Back to Blog
EducationJanuary 21, 20269 min read

The Basics of Hotel Content Standardization

By Engineering Team

Hotel content standardization is the foundation of any reliable travel platform. Without it, you're building on quicksand.

Here's everything you need to know about standardizing hotel data, from basic principles to practical implementation.

What Is Hotel Content Standardization?

Standardization means transforming diverse, inconsistent hotel data into a uniform, predictable format.

Before standardization:

Supplier A: {"hotel_name": "HILTON PARIS OPERA", "stars": "4"}
Supplier B: {"name": "Hilton Paris Opera", "category": "UPSCALE"}
Supplier C: {"hotelName": "Hilton Opera Paris", "classification": {"stars": 4}}

After standardization:

{
  "name": "Hilton Paris Opera",
  "stars": 4,
  "category": "upscale"
}

Why Standardization Matters

1. Predictable API Responses

Without standardization:

// Client code needs to handle every variation
const stars = hotel.stars || hotel.star_rating || hotel.classification?.stars;
const name = hotel.name || hotel.hotel_name || hotel.hotelName;

With standardization:

// Always the same structure
const stars = hotel.stars;
const name = hotel.name;

2. Reliable Search and Filtering

Without standardization:

  • "4 star" hotels stored as: "4", 4, "****", "FOUR", 4.0
  • Searching for 4-star hotels misses most results

With standardization:

  • Always stored as: 4 (integer)
  • Filtering works reliably

3. Accurate Deduplication

Without standardization:

  • "HILTON PARIS OPERA" vs "Hilton Paris Opera" → Fuzzy match fails
  • Need to handle case, spacing, punctuation variations

With standardization:

  • Both normalized to "Hilton Paris Opera"
  • Exact match works reliably

4. Better User Experience

Without standardization:

Search Results:
- HOLIDAY INN PARIS NOTRE DAME
- Holiday Inn Paris Notre Dame
- holiday inn paris notre dame
(Same hotel, three times)

With standardization:

Search Results:
- Holiday Inn Paris Notre Dame
(Once, correctly formatted)

What to Standardize

1. Hotel Names

Goal: Consistent capitalization, formatting, and structure.

Rules:

  • Title Case: "Hilton Paris Opera" (not "HILTON PARIS OPERA")
  • Remove chain prefixes: "Paris Opera Hilton" not "Hilton Hotels - Paris Opera"
  • Normalize Unicode: "Hôtel" → consistent encoding (NFC)
  • Trim whitespace: No leading/trailing spaces
  • Collapse multiple spaces: "Paris Opera" → "Paris Opera"

Implementation:

import unicodedata
import re

def standardize_hotel_name(name):
    if not name:
        return None
    
    # Convert to Unicode NFC (normalized form)
    name = unicodedata.normalize('NFC', name)
    
    # Title case
    name = name.title()
    
    # Remove common chain prefixes
    chain_prefixes = [
        r'^Hilton Hotels?\s*-\s*',
        r'^Marriott\s*-\s*',
        r'^IHG\s*-\s*',
        r'^Best Western\s*-\s*',
    ]
    for prefix in chain_prefixes:
        name = re.sub(prefix, '', name, flags=re.IGNORECASE)
    
    # Collapse multiple spaces
    name = re.sub(r'\s+', ' ', name)
    
    # Trim
    name = name.strip()
    
    return name

Examples:

  • "HILTON PARIS OPERA""Hilton Paris Opera"
  • "Hilton Hotels - Paris Opera""Paris Opera"
  • "Paris Opera""Paris Opera"

2. Addresses

Goal: Consistent structure and formatting.

Rules:

  • Separate components: street, city, postal code, country
  • Expand abbreviations: "St" → "Street", "Ave" → "Avenue"
  • Standardize country codes: Use ISO 3166-1 alpha-2 ("FR", "US")
  • Title case for city: "Paris" not "PARIS"
  • Uppercase postal codes: "75008" or "SW1A 1AA"

Implementation:

from postal.parser import parse_address

ABBREVIATIONS = {
    'st': 'Street',
    'ave': 'Avenue',
    'blvd': 'Boulevard',
    'rd': 'Road',
    'dr': 'Drive',
    'ln': 'Lane',
    'ct': 'Court',
}

def standardize_address(raw_address, city, country_code):
    # Parse address using libpostal
    parsed = parse_address(raw_address)
    
    # Extract components
    street = None
    postal_code = None
    
    for component, label in parsed:
        if label == 'house_number' or label == 'road':
            street = (street or '') + ' ' + component
        elif label == 'postcode':
            postal_code = component
    
    # Expand abbreviations in street
    if street:
        for abbr, full in ABBREVIATIONS.items():
            street = re.sub(rf'\b{abbr}\b', full, street, flags=re.IGNORECASE)
        street = street.title().strip()
    
    # Standardize city
    city = city.title() if city else None
    
    # Standardize country code (ensure ISO-2)
    country_code = to_iso2(country_code)
    
    return {
        'street': street,
        'city': city,
        'postal_code': postal_code,
        'country_code': country_code,
        'country_name': get_country_name(country_code)
    }

Examples:

Input:  "108 rue st lazare, PARIS, France"
Output: {
  "street": "108 Rue Saint Lazare",
  "city": "Paris",
  "postal_code": None,
  "country_code": "FR",
  "country_name": "France"
}

3. Geographic Coordinates

Goal: Consistent precision and format.

Rules:

  • Always float type (not string)
  • Round to 6 decimal places (~0.1 meter precision)
  • Validate ranges: latitude (-90 to 90), longitude (-180 to 180)
  • Standardize order: always (latitude, longitude)

Implementation:

def standardize_coordinates(lat, lon, city=None, country=None):
    # Convert to float
    try:
        lat = float(lat)
        lon = float(lon)
    except (TypeError, ValueError):
        return None
    
    # Round to 6 decimal places
    lat = round(lat, 6)
    lon = round(lon, 6)
    
    # Validate ranges
    if not (-90 <= lat <= 90):
        raise ValueError(f"Invalid latitude: {lat}")
    if not (-180 <= lon <= 180):
        raise ValueError(f"Invalid longitude: {lon}")
    
    # Optional: Validate against expected city location
    if city and country:
        expected = geocode(f"{city}, {country}")
        distance = haversine(lat, lon, expected.lat, expected.lon)
        if distance > 100:  # More than 100km from city
            # Log warning but don't fail
            logger.warning(f"Coordinates {distance}km from {city}")
    
    return {
        'latitude': lat,
        'longitude': lon
    }

Examples:

  • "48.876100"48.8761
  • 48.8761001234548.8761
  • "invalid"None

4. Star Ratings

Goal: Uniform 0-5 scale with half-star precision.

Rules:

  • Numeric type (float)
  • Range: 0.0 to 5.0
  • Half-star increments: 0.0, 0.5, 1.0, 1.5, ..., 5.0
  • Handle null explicitly (unknown vs. unrated)

Implementation:

def standardize_star_rating(rating, scale='1-5'):
    if rating is None or rating == '':
        return None
    
    # Convert to float
    try:
        rating = float(rating)
    except (TypeError, ValueError):
        # Handle letter grades
        if isinstance(rating, str):
            letter_map = {'A': 5.0, 'B': 4.0, 'C': 3.0, 'D': 2.0, 'E': 1.0}
            return letter_map.get(rating.upper())
        return None
    
    # Convert from different scales
    if scale == '1-7':
        rating = (rating / 7) * 5
    elif scale == '1-10':
        rating = (rating / 10) * 5
    
    # Round to nearest half-star
    rating = round(rating * 2) / 2
    
    # Validate range
    if not (0 <= rating <= 5):
        return None
    
    return rating

Examples:

  • "4"4.0
  • 4.34.5 (rounded to nearest half)
  • "A"5.0
  • Scale 1-7, rating 5 → 3.5 (5/7 * 5 = 3.57 → 3.5)

5. Amenities

Goal: Consistent codes and descriptions.

Rules:

  • Define canonical taxonomy
  • Map all supplier codes to taxonomy
  • Store both code (for filtering) and name (for display)
  • Include metadata (free/paid, details)

Implementation:

# Define taxonomy
AMENITY_TAXONOMY = {
    'WIFI': {
        'name': 'WiFi',
        'category': 'internet',
        'aliases': ['FREE_WIFI', 'WIRELESS', 'INTERNET', 'WIFI_FREE', 'COMPLIMENTARY_WIFI']
    },
    'PARKING': {
        'name': 'Parking',
        'category': 'parking',
        'aliases': ['CAR_PARK', 'PARKING_LOT', 'VALET', 'GARAGE']
    },
    'POOL': {
        'name': 'Swimming Pool',
        'category': 'recreation',
        'aliases': ['SWIMMING_POOL', 'OUTDOOR_POOL', 'INDOOR_POOL']
    },
    'GYM': {
        'name': 'Fitness Center',
        'category': 'recreation',
        'aliases': ['FITNESS', 'FITNESS_CENTER', 'EXERCISE_ROOM', 'WORKOUT_ROOM']
    },
    'RESTAURANT': {
        'name': 'Restaurant',
        'category': 'dining',
        'aliases': ['ON_SITE_DINING', 'DINING', 'FOOD']
    }
}

def standardize_amenity(raw_amenity):
    # Normalize input
    raw_upper = raw_amenity.upper().strip()
    
    # Find match in taxonomy
    for code, meta in AMENITY_TAXONOMY.items():
        if raw_upper == code or raw_upper in meta['aliases']:
            return {
                'code': code,
                'name': meta['name'],
                'category': meta['category']
            }
    
    # Unknown amenity - store as-is
    return {
        'code': 'OTHER',
        'name': raw_amenity.title(),
        'category': 'other'
    }

def standardize_amenities(raw_amenities):
    # Handle different input formats
    if isinstance(raw_amenities, str):
        # Parse comma-separated string
        raw_amenities = [a.strip() for a in raw_amenities.split(',')]
    
    # Standardize each
    standardized = [standardize_amenity(a) for a in raw_amenities]
    
    # Deduplicate by code
    seen = set()
    result = []
    for amenity in standardized:
        if amenity['code'] not in seen:
            seen.add(amenity['code'])
            result.append(amenity)
    
    return result

Examples:

Input:  "WIFI,FREE_WIFI,PARKING"
Output: [
  {"code": "WIFI", "name": "WiFi", "category": "internet"},
  {"code": "PARKING", "name": "Parking", "category": "parking"}
]
# Note: FREE_WIFI deduplicated to WIFI

6. Chain and Brand

Goal: Consistent chain/brand hierarchy.

Rules:

  • Separate chain (parent company) from brand (hotel type)
  • Use standardized chain codes
  • Map legacy brands to current owners
  • Allow null for independent hotels

Implementation:

CHAIN_TAXONOMY = {
    'HILTON': {
        'name': 'Hilton',
        'code': 'HH',
        'brands': {
            'Hilton Hotels & Resorts': 'HILTON_FLAGSHIP',
            'Hilton Garden Inn': 'HILTON_GARDEN_INN',
            'Hampton by Hilton': 'HAMPTON',
            'DoubleTree by Hilton': 'DOUBLETREE',
            'Embassy Suites': 'EMBASSY_SUITES'
        }
    },
    'MARRIOTT': {
        'name': 'Marriott International',
        'code': 'MC',
        'brands': {
            'Marriott Hotels': 'MARRIOTT_FLAGSHIP',
            'Courtyard by Marriott': 'COURTYARD',
            'Sheraton': 'SHERATON',  # Acquired from Starwood
            'Westin': 'WESTIN',      # Acquired from Starwood
            'W Hotels': 'W_HOTELS',
            'Renaissance': 'RENAISSANCE'
        }
    },
    'IHG': {
        'name': 'IHG Hotels & Resorts',
        'code': 'IH',
        'brands': {
            'Holiday Inn': 'HOLIDAY_INN',
            'Crowne Plaza': 'CROWNE_PLAZA',
            'InterContinental': 'INTERCONTINENTAL',
            'Kimpton': 'KIMPTON'
        }
    }
}

def standardize_chain(hotel_name):
    # Detect chain and brand from hotel name
    for chain_code, chain_meta in CHAIN_TAXONOMY.items():
        for brand_name, brand_code in chain_meta['brands'].items():
            if brand_name.lower() in hotel_name.lower():
                return {
                    'chain_code': chain_code,
                    'chain_name': chain_meta['name'],
                    'brand_code': brand_code,
                    'brand_name': brand_name
                }
    
    # Independent hotel
    return None

Examples:

Input:  "Hilton Garden Inn Paris Opera"
Output: {
  "chain_code": "HILTON",
  "chain_name": "Hilton",
  "brand_code": "HILTON_GARDEN_INN",
  "brand_name": "Hilton Garden Inn"
}

Input:  "Grand Hotel Paris"
Output: None  # Independent hotel

Building a Standardization Pipeline

Step 1: Define Schema

Create a canonical data model:

from dataclasses import dataclass
from typing import Optional, List

@dataclass
class Address:
    street: Optional[str]
    city: str
    postal_code: Optional[str]
    country_code: str
    country_name: str

@dataclass
class Location:
    latitude: float
    longitude: float

@dataclass
class Amenity:
    code: str
    name: str
    category: str

@dataclass
class Chain:
    chain_code: str
    chain_name: str
    brand_code: Optional[str]
    brand_name: Optional[str]

@dataclass
class StandardizedHotel:
    name: str
    address: Address
    location: Optional[Location]
    stars: Optional[float]
    amenities: List[Amenity]
    chain: Optional[Chain]
    description: Optional[str]

Step 2: Build Parsers

Create supplier-specific parsers:

class SupplierAParser:
    def parse(self, raw_data):
        return {
            'name': raw_data.get('hotel_name'),
            'address': raw_data.get('address'),
            'city': raw_data.get('city'),
            'country': raw_data.get('country'),
            'lat': raw_data.get('latitude'),
            'lon': raw_data.get('longitude'),
            'stars': raw_data.get('star_rating'),
            'amenities': raw_data.get('amenities', '').split(',')
        }

class SupplierBParser:
    def parse(self, raw_data):
        return {
            'name': raw_data.get('name'),
            'address': raw_data.get('addr1'),
            'city': raw_data.get('cityName'),
            'country': raw_data.get('countryCode'),
            'lat': raw_data.get('lat'),
            'lon': raw_data.get('lon'),
            'stars': self._parse_category(raw_data.get('category')),
            'amenities': raw_data.get('facilities', [])
        }
    
    def _parse_category(self, category):
        # Map category to stars
        category_map = {
            'LUXURY': 5, 'UPSCALE': 4, 'MIDSCALE': 3, 
            'ECONOMY': 2, 'BUDGET': 1
        }
        return category_map.get(category)

Step 3: Apply Standardization

Transform parsed data:

def standardize_hotel(parsed_data, supplier_id):
    try:
        # Standardize each field
        name = standardize_hotel_name(parsed_data['name'])
        address = standardize_address(
            parsed_data['address'],
            parsed_data['city'],
            parsed_data['country']
        )
        location = standardize_coordinates(
            parsed_data['lat'],
            parsed_data['lon'],
            parsed_data['city'],
            parsed_data['country']
        )
        stars = standardize_star_rating(parsed_data['stars'])
        amenities = standardize_amenities(parsed_data['amenities'])
        chain = standardize_chain(name)
        
        # Create standardized object
        return StandardizedHotel(
            name=name,
            address=address,
            location=location,
            stars=stars,
            amenities=amenities,
            chain=chain,
            description=parsed_data.get('description')
        )
    
    except Exception as e:
        logger.error(f"Standardization failed for {supplier_id}: {e}")
        return None

Step 4: Validate

Check standardized output:

def validate_standardized_hotel(hotel):
    errors = []
    
    # Required fields
    if not hotel.name:
        errors.append("Missing hotel name")
    if not hotel.address or not hotel.address.city:
        errors.append("Missing city")
    
    # Data quality
    if hotel.stars and not (0 <= hotel.stars <= 5):
        errors.append(f"Invalid star rating: {hotel.stars}")
    
    if hotel.location:
        lat, lon = hotel.location.latitude, hotel.location.longitude
        if not (-90 <= lat <= 90 and -180 <= lon <= 180):
            errors.append(f"Invalid coordinates: {lat}, {lon}")
    
    return errors

Step 5: Store

Persist standardized data:

def store_standardized_hotel(hotel, supplier_id, supplier_hotel_id):
    # Convert to dict for database storage
    data = {
        'name': hotel.name,
        'address': {
            'street': hotel.address.street,
            'city': hotel.address.city,
            'postal_code': hotel.address.postal_code,
            'country_code': hotel.address.country_code
        },
        'location': {
            'latitude': hotel.location.latitude,
            'longitude': hotel.location.longitude
        } if hotel.location else None,
        'stars': hotel.stars,
        'amenities': [
            {'code': a.code, 'name': a.name, 'category': a.category}
            for a in hotel.amenities
        ],
        'chain': {
            'chain_code': hotel.chain.chain_code,
            'chain_name': hotel.chain.chain_name,
            'brand_code': hotel.chain.brand_code,
            'brand_name': hotel.chain.brand_name
        } if hotel.chain else None,
        'supplier_id': supplier_id,
        'supplier_hotel_id': supplier_hotel_id,
        'standardized_at': datetime.now()
    }
    
    db.standardized_hotels.insert_one(data)

Monitoring Standardization Quality

Track metrics:

Completeness

SELECT
  supplier_id,
  COUNT(*) as total,
  SUM(CASE WHEN location IS NOT NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as pct_with_coords,
  SUM(CASE WHEN stars IS NOT NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as pct_with_stars,
  SUM(CASE WHEN chain IS NOT NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as pct_with_chain
FROM standardized_hotels
GROUP BY supplier_id;

Consistency

-- Check for name case issues
SELECT name, COUNT(*)
FROM standardized_hotels
WHERE name != INITCAP(name)
GROUP BY name;

-- Check for coordinate outliers
SELECT name, city, 
  ST_Distance(location, ST_Point(city_center_lon, city_center_lat)) as distance_km
FROM standardized_hotels
WHERE ST_Distance(...) > 50;  -- More than 50km from city center

Best Practices

  1. Version your standardization logic: Track changes, allow rollback
  2. Preserve raw data: Always keep original supplier data
  3. Log transformations: Record what changed during standardization
  4. Monitor quality continuously: Don't standardize once and forget
  5. Allow manual overrides: Sometimes automated standardization is wrong

Using mapping.travel

Building standardization from scratch takes months. mapping.travel provides:

  • Pre-standardized hotel database (1M+ hotels)
  • Standardization API (send raw data, get standardized output)
  • Continuous updates (new hotels, rebranding, closures)
  • Open-source normalization pipeline

Learn more: Why Hotel Data Normalization Matters


Questions about hotel content standardization? Join our Discord community or email hello@mapping.travel.