The Basics of Hotel Content Standardization
By Engineering Team
Hotel content standardization is the foundation of any reliable travel platform. Without it, you're building on quicksand.
Here's everything you need to know about standardizing hotel data, from basic principles to practical implementation.
What Is Hotel Content Standardization?
Standardization means transforming diverse, inconsistent hotel data into a uniform, predictable format.
Before standardization:
Supplier A: {"hotel_name": "HILTON PARIS OPERA", "stars": "4"}
Supplier B: {"name": "Hilton Paris Opera", "category": "UPSCALE"}
Supplier C: {"hotelName": "Hilton Opera Paris", "classification": {"stars": 4}}
After standardization:
{
"name": "Hilton Paris Opera",
"stars": 4,
"category": "upscale"
}
Why Standardization Matters
1. Predictable API Responses
Without standardization:
// Client code needs to handle every variation
const stars = hotel.stars || hotel.star_rating || hotel.classification?.stars;
const name = hotel.name || hotel.hotel_name || hotel.hotelName;
With standardization:
// Always the same structure
const stars = hotel.stars;
const name = hotel.name;
2. Reliable Search and Filtering
Without standardization:
- "4 star" hotels stored as:
"4",4,"****","FOUR",4.0 - Searching for 4-star hotels misses most results
With standardization:
- Always stored as:
4(integer) - Filtering works reliably
3. Accurate Deduplication
Without standardization:
- "HILTON PARIS OPERA" vs "Hilton Paris Opera" → Fuzzy match fails
- Need to handle case, spacing, punctuation variations
With standardization:
- Both normalized to "Hilton Paris Opera"
- Exact match works reliably
4. Better User Experience
Without standardization:
Search Results:
- HOLIDAY INN PARIS NOTRE DAME
- Holiday Inn Paris Notre Dame
- holiday inn paris notre dame
(Same hotel, three times)
With standardization:
Search Results:
- Holiday Inn Paris Notre Dame
(Once, correctly formatted)
What to Standardize
1. Hotel Names
Goal: Consistent capitalization, formatting, and structure.
Rules:
- Title Case: "Hilton Paris Opera" (not "HILTON PARIS OPERA")
- Remove chain prefixes: "Paris Opera Hilton" not "Hilton Hotels - Paris Opera"
- Normalize Unicode: "Hôtel" → consistent encoding (NFC)
- Trim whitespace: No leading/trailing spaces
- Collapse multiple spaces: "Paris Opera" → "Paris Opera"
Implementation:
import unicodedata
import re
def standardize_hotel_name(name):
if not name:
return None
# Convert to Unicode NFC (normalized form)
name = unicodedata.normalize('NFC', name)
# Title case
name = name.title()
# Remove common chain prefixes
chain_prefixes = [
r'^Hilton Hotels?\s*-\s*',
r'^Marriott\s*-\s*',
r'^IHG\s*-\s*',
r'^Best Western\s*-\s*',
]
for prefix in chain_prefixes:
name = re.sub(prefix, '', name, flags=re.IGNORECASE)
# Collapse multiple spaces
name = re.sub(r'\s+', ' ', name)
# Trim
name = name.strip()
return name
Examples:
"HILTON PARIS OPERA"→"Hilton Paris Opera""Hilton Hotels - Paris Opera"→"Paris Opera""Paris Opera"→"Paris Opera"
2. Addresses
Goal: Consistent structure and formatting.
Rules:
- Separate components: street, city, postal code, country
- Expand abbreviations: "St" → "Street", "Ave" → "Avenue"
- Standardize country codes: Use ISO 3166-1 alpha-2 ("FR", "US")
- Title case for city: "Paris" not "PARIS"
- Uppercase postal codes: "75008" or "SW1A 1AA"
Implementation:
from postal.parser import parse_address
ABBREVIATIONS = {
'st': 'Street',
'ave': 'Avenue',
'blvd': 'Boulevard',
'rd': 'Road',
'dr': 'Drive',
'ln': 'Lane',
'ct': 'Court',
}
def standardize_address(raw_address, city, country_code):
# Parse address using libpostal
parsed = parse_address(raw_address)
# Extract components
street = None
postal_code = None
for component, label in parsed:
if label == 'house_number' or label == 'road':
street = (street or '') + ' ' + component
elif label == 'postcode':
postal_code = component
# Expand abbreviations in street
if street:
for abbr, full in ABBREVIATIONS.items():
street = re.sub(rf'\b{abbr}\b', full, street, flags=re.IGNORECASE)
street = street.title().strip()
# Standardize city
city = city.title() if city else None
# Standardize country code (ensure ISO-2)
country_code = to_iso2(country_code)
return {
'street': street,
'city': city,
'postal_code': postal_code,
'country_code': country_code,
'country_name': get_country_name(country_code)
}
Examples:
Input: "108 rue st lazare, PARIS, France"
Output: {
"street": "108 Rue Saint Lazare",
"city": "Paris",
"postal_code": None,
"country_code": "FR",
"country_name": "France"
}
3. Geographic Coordinates
Goal: Consistent precision and format.
Rules:
- Always float type (not string)
- Round to 6 decimal places (~0.1 meter precision)
- Validate ranges: latitude (-90 to 90), longitude (-180 to 180)
- Standardize order: always (latitude, longitude)
Implementation:
def standardize_coordinates(lat, lon, city=None, country=None):
# Convert to float
try:
lat = float(lat)
lon = float(lon)
except (TypeError, ValueError):
return None
# Round to 6 decimal places
lat = round(lat, 6)
lon = round(lon, 6)
# Validate ranges
if not (-90 <= lat <= 90):
raise ValueError(f"Invalid latitude: {lat}")
if not (-180 <= lon <= 180):
raise ValueError(f"Invalid longitude: {lon}")
# Optional: Validate against expected city location
if city and country:
expected = geocode(f"{city}, {country}")
distance = haversine(lat, lon, expected.lat, expected.lon)
if distance > 100: # More than 100km from city
# Log warning but don't fail
logger.warning(f"Coordinates {distance}km from {city}")
return {
'latitude': lat,
'longitude': lon
}
Examples:
"48.876100"→48.876148.87610012345→48.8761"invalid"→None
4. Star Ratings
Goal: Uniform 0-5 scale with half-star precision.
Rules:
- Numeric type (float)
- Range: 0.0 to 5.0
- Half-star increments: 0.0, 0.5, 1.0, 1.5, ..., 5.0
- Handle null explicitly (unknown vs. unrated)
Implementation:
def standardize_star_rating(rating, scale='1-5'):
if rating is None or rating == '':
return None
# Convert to float
try:
rating = float(rating)
except (TypeError, ValueError):
# Handle letter grades
if isinstance(rating, str):
letter_map = {'A': 5.0, 'B': 4.0, 'C': 3.0, 'D': 2.0, 'E': 1.0}
return letter_map.get(rating.upper())
return None
# Convert from different scales
if scale == '1-7':
rating = (rating / 7) * 5
elif scale == '1-10':
rating = (rating / 10) * 5
# Round to nearest half-star
rating = round(rating * 2) / 2
# Validate range
if not (0 <= rating <= 5):
return None
return rating
Examples:
"4"→4.04.3→4.5(rounded to nearest half)"A"→5.0- Scale 1-7, rating 5 →
3.5(5/7 * 5 = 3.57 → 3.5)
5. Amenities
Goal: Consistent codes and descriptions.
Rules:
- Define canonical taxonomy
- Map all supplier codes to taxonomy
- Store both code (for filtering) and name (for display)
- Include metadata (free/paid, details)
Implementation:
# Define taxonomy
AMENITY_TAXONOMY = {
'WIFI': {
'name': 'WiFi',
'category': 'internet',
'aliases': ['FREE_WIFI', 'WIRELESS', 'INTERNET', 'WIFI_FREE', 'COMPLIMENTARY_WIFI']
},
'PARKING': {
'name': 'Parking',
'category': 'parking',
'aliases': ['CAR_PARK', 'PARKING_LOT', 'VALET', 'GARAGE']
},
'POOL': {
'name': 'Swimming Pool',
'category': 'recreation',
'aliases': ['SWIMMING_POOL', 'OUTDOOR_POOL', 'INDOOR_POOL']
},
'GYM': {
'name': 'Fitness Center',
'category': 'recreation',
'aliases': ['FITNESS', 'FITNESS_CENTER', 'EXERCISE_ROOM', 'WORKOUT_ROOM']
},
'RESTAURANT': {
'name': 'Restaurant',
'category': 'dining',
'aliases': ['ON_SITE_DINING', 'DINING', 'FOOD']
}
}
def standardize_amenity(raw_amenity):
# Normalize input
raw_upper = raw_amenity.upper().strip()
# Find match in taxonomy
for code, meta in AMENITY_TAXONOMY.items():
if raw_upper == code or raw_upper in meta['aliases']:
return {
'code': code,
'name': meta['name'],
'category': meta['category']
}
# Unknown amenity - store as-is
return {
'code': 'OTHER',
'name': raw_amenity.title(),
'category': 'other'
}
def standardize_amenities(raw_amenities):
# Handle different input formats
if isinstance(raw_amenities, str):
# Parse comma-separated string
raw_amenities = [a.strip() for a in raw_amenities.split(',')]
# Standardize each
standardized = [standardize_amenity(a) for a in raw_amenities]
# Deduplicate by code
seen = set()
result = []
for amenity in standardized:
if amenity['code'] not in seen:
seen.add(amenity['code'])
result.append(amenity)
return result
Examples:
Input: "WIFI,FREE_WIFI,PARKING"
Output: [
{"code": "WIFI", "name": "WiFi", "category": "internet"},
{"code": "PARKING", "name": "Parking", "category": "parking"}
]
# Note: FREE_WIFI deduplicated to WIFI
6. Chain and Brand
Goal: Consistent chain/brand hierarchy.
Rules:
- Separate chain (parent company) from brand (hotel type)
- Use standardized chain codes
- Map legacy brands to current owners
- Allow null for independent hotels
Implementation:
CHAIN_TAXONOMY = {
'HILTON': {
'name': 'Hilton',
'code': 'HH',
'brands': {
'Hilton Hotels & Resorts': 'HILTON_FLAGSHIP',
'Hilton Garden Inn': 'HILTON_GARDEN_INN',
'Hampton by Hilton': 'HAMPTON',
'DoubleTree by Hilton': 'DOUBLETREE',
'Embassy Suites': 'EMBASSY_SUITES'
}
},
'MARRIOTT': {
'name': 'Marriott International',
'code': 'MC',
'brands': {
'Marriott Hotels': 'MARRIOTT_FLAGSHIP',
'Courtyard by Marriott': 'COURTYARD',
'Sheraton': 'SHERATON', # Acquired from Starwood
'Westin': 'WESTIN', # Acquired from Starwood
'W Hotels': 'W_HOTELS',
'Renaissance': 'RENAISSANCE'
}
},
'IHG': {
'name': 'IHG Hotels & Resorts',
'code': 'IH',
'brands': {
'Holiday Inn': 'HOLIDAY_INN',
'Crowne Plaza': 'CROWNE_PLAZA',
'InterContinental': 'INTERCONTINENTAL',
'Kimpton': 'KIMPTON'
}
}
}
def standardize_chain(hotel_name):
# Detect chain and brand from hotel name
for chain_code, chain_meta in CHAIN_TAXONOMY.items():
for brand_name, brand_code in chain_meta['brands'].items():
if brand_name.lower() in hotel_name.lower():
return {
'chain_code': chain_code,
'chain_name': chain_meta['name'],
'brand_code': brand_code,
'brand_name': brand_name
}
# Independent hotel
return None
Examples:
Input: "Hilton Garden Inn Paris Opera"
Output: {
"chain_code": "HILTON",
"chain_name": "Hilton",
"brand_code": "HILTON_GARDEN_INN",
"brand_name": "Hilton Garden Inn"
}
Input: "Grand Hotel Paris"
Output: None # Independent hotel
Building a Standardization Pipeline
Step 1: Define Schema
Create a canonical data model:
from dataclasses import dataclass
from typing import Optional, List
@dataclass
class Address:
street: Optional[str]
city: str
postal_code: Optional[str]
country_code: str
country_name: str
@dataclass
class Location:
latitude: float
longitude: float
@dataclass
class Amenity:
code: str
name: str
category: str
@dataclass
class Chain:
chain_code: str
chain_name: str
brand_code: Optional[str]
brand_name: Optional[str]
@dataclass
class StandardizedHotel:
name: str
address: Address
location: Optional[Location]
stars: Optional[float]
amenities: List[Amenity]
chain: Optional[Chain]
description: Optional[str]
Step 2: Build Parsers
Create supplier-specific parsers:
class SupplierAParser:
def parse(self, raw_data):
return {
'name': raw_data.get('hotel_name'),
'address': raw_data.get('address'),
'city': raw_data.get('city'),
'country': raw_data.get('country'),
'lat': raw_data.get('latitude'),
'lon': raw_data.get('longitude'),
'stars': raw_data.get('star_rating'),
'amenities': raw_data.get('amenities', '').split(',')
}
class SupplierBParser:
def parse(self, raw_data):
return {
'name': raw_data.get('name'),
'address': raw_data.get('addr1'),
'city': raw_data.get('cityName'),
'country': raw_data.get('countryCode'),
'lat': raw_data.get('lat'),
'lon': raw_data.get('lon'),
'stars': self._parse_category(raw_data.get('category')),
'amenities': raw_data.get('facilities', [])
}
def _parse_category(self, category):
# Map category to stars
category_map = {
'LUXURY': 5, 'UPSCALE': 4, 'MIDSCALE': 3,
'ECONOMY': 2, 'BUDGET': 1
}
return category_map.get(category)
Step 3: Apply Standardization
Transform parsed data:
def standardize_hotel(parsed_data, supplier_id):
try:
# Standardize each field
name = standardize_hotel_name(parsed_data['name'])
address = standardize_address(
parsed_data['address'],
parsed_data['city'],
parsed_data['country']
)
location = standardize_coordinates(
parsed_data['lat'],
parsed_data['lon'],
parsed_data['city'],
parsed_data['country']
)
stars = standardize_star_rating(parsed_data['stars'])
amenities = standardize_amenities(parsed_data['amenities'])
chain = standardize_chain(name)
# Create standardized object
return StandardizedHotel(
name=name,
address=address,
location=location,
stars=stars,
amenities=amenities,
chain=chain,
description=parsed_data.get('description')
)
except Exception as e:
logger.error(f"Standardization failed for {supplier_id}: {e}")
return None
Step 4: Validate
Check standardized output:
def validate_standardized_hotel(hotel):
errors = []
# Required fields
if not hotel.name:
errors.append("Missing hotel name")
if not hotel.address or not hotel.address.city:
errors.append("Missing city")
# Data quality
if hotel.stars and not (0 <= hotel.stars <= 5):
errors.append(f"Invalid star rating: {hotel.stars}")
if hotel.location:
lat, lon = hotel.location.latitude, hotel.location.longitude
if not (-90 <= lat <= 90 and -180 <= lon <= 180):
errors.append(f"Invalid coordinates: {lat}, {lon}")
return errors
Step 5: Store
Persist standardized data:
def store_standardized_hotel(hotel, supplier_id, supplier_hotel_id):
# Convert to dict for database storage
data = {
'name': hotel.name,
'address': {
'street': hotel.address.street,
'city': hotel.address.city,
'postal_code': hotel.address.postal_code,
'country_code': hotel.address.country_code
},
'location': {
'latitude': hotel.location.latitude,
'longitude': hotel.location.longitude
} if hotel.location else None,
'stars': hotel.stars,
'amenities': [
{'code': a.code, 'name': a.name, 'category': a.category}
for a in hotel.amenities
],
'chain': {
'chain_code': hotel.chain.chain_code,
'chain_name': hotel.chain.chain_name,
'brand_code': hotel.chain.brand_code,
'brand_name': hotel.chain.brand_name
} if hotel.chain else None,
'supplier_id': supplier_id,
'supplier_hotel_id': supplier_hotel_id,
'standardized_at': datetime.now()
}
db.standardized_hotels.insert_one(data)
Monitoring Standardization Quality
Track metrics:
Completeness
SELECT
supplier_id,
COUNT(*) as total,
SUM(CASE WHEN location IS NOT NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as pct_with_coords,
SUM(CASE WHEN stars IS NOT NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as pct_with_stars,
SUM(CASE WHEN chain IS NOT NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as pct_with_chain
FROM standardized_hotels
GROUP BY supplier_id;
Consistency
-- Check for name case issues
SELECT name, COUNT(*)
FROM standardized_hotels
WHERE name != INITCAP(name)
GROUP BY name;
-- Check for coordinate outliers
SELECT name, city,
ST_Distance(location, ST_Point(city_center_lon, city_center_lat)) as distance_km
FROM standardized_hotels
WHERE ST_Distance(...) > 50; -- More than 50km from city center
Best Practices
- Version your standardization logic: Track changes, allow rollback
- Preserve raw data: Always keep original supplier data
- Log transformations: Record what changed during standardization
- Monitor quality continuously: Don't standardize once and forget
- Allow manual overrides: Sometimes automated standardization is wrong
Using mapping.travel
Building standardization from scratch takes months. mapping.travel provides:
- Pre-standardized hotel database (1M+ hotels)
- Standardization API (send raw data, get standardized output)
- Continuous updates (new hotels, rebranding, closures)
- Open-source normalization pipeline
Learn more: Why Hotel Data Normalization Matters
Questions about hotel content standardization? Join our Discord community or email hello@mapping.travel.