Back to Blog
EngineeringJanuary 27, 20269 min read

Why Hotel Data Normalization Matters in Travel APIs

By Engineering Team

If you're building a travel application that aggregates hotel data from multiple sources, you've discovered a painful truth: supplier data is wildly inconsistent.

Different formats, different naming conventions, different schemas, different quality standards. Without normalization, your API becomes an unreliable mess.

Here's why hotel data normalization matters and how to do it right.

The Problem: Raw Supplier Data Is Chaos

Let's look at the same hotel from three different suppliers:

Supplier A (Bedbank)

{
  "hotel_id": "HTL_12345",
  "hotel_name": "HILTON PARIS OPERA",
  "address": "108 RUE SAINT LAZARE",
  "city": "PARIS",
  "country": "FR",
  "latitude": "48.876100",
  "longitude": "2.326600",
  "star_rating": "4",
  "amenities": "WIFI,PARKING,RESTAURANT"
}

Supplier B (GDS)

{
  "propertyCode": "PAR_HIL_987",
  "name": "Hilton Paris Opera",
  "addr1": "108 Rue Saint-Lazare",
  "addr2": "",
  "cityName": "Paris",
  "countryCode": "France",
  "lat": 48.8761,
  "lon": 2.3266,
  "category": "UPSCALE",
  "facilities": ["Free WiFi", "Parking Available", "On-site Dining"]
}

Supplier C (Direct Connect)

{
  "id": "5678-FR",
  "hotelName": "Hilton Opera Paris",
  "location": {
    "street": "108 rue Saint Lazare",
    "city": "Paris",
    "postalCode": "75008",
    "country": "FRA",
    "coordinates": {
      "latitude": 48.876100,
      "longitude": 2.326600
    }
  },
  "classification": {
    "stars": 4,
    "category": "Premium"
  },
  "services": {
    "wifi": true,
    "parking": {"available": true, "type": "Paid"},
    "food_beverage": {"restaurant": true, "bar": true}
  }
}

Same hotel. Completely different formats.

What Data Normalization Solves

Normalization transforms disparate data into a consistent, canonical format:

{
  "id": "HOTEL_00001",
  "name": "Hilton Paris Opera",
  "address": {
    "street": "108 Rue Saint-Lazare",
    "city": "Paris",
    "postal_code": "75008",
    "country_code": "FR",
    "country_name": "France"
  },
  "location": {
    "latitude": 48.8761,
    "longitude": 2.3266
  },
  "classification": {
    "stars": 4,
    "category": "upscale"
  },
  "amenities": [
    {"code": "WIFI", "name": "Free WiFi"},
    {"code": "PARKING", "name": "Parking (Paid)"},
    {"code": "RESTAURANT", "name": "Restaurant"}
  ],
  "supplier_mappings": [
    {"supplier": "supplier_a", "id": "HTL_12345"},
    {"supplier": "supplier_b", "id": "PAR_HIL_987"},
    {"supplier": "supplier_c", "id": "5678-FR"}
  ]
}

This enables:

  • Consistent API responses
  • Reliable filtering and search
  • Accurate deduplication
  • Cross-supplier comparison
  • Easier downstream processing

Key Normalization Challenges

1. Name Standardization

Problems:

  • ALL CAPS vs. Title Case vs. Sentence case
  • Special characters: "Hôtel" vs "Hotel"
  • Chain prefixes: "Hilton - Paris Opera" vs "Hilton Paris Opera"
  • Word order: "Opera Paris Hilton" vs "Hilton Paris Opera"

Solution:

  • Define canonical format (e.g., Title Case)
  • Strip chain prefixes
  • Normalize Unicode (NFD → NFC)
  • Store variations for search, display canonical

Example normalization logic:

def normalize_hotel_name(name):
    # Convert to title case
    name = name.title()
    
    # Remove common chain prefixes
    prefixes = ["Hilton Hotels -", "Marriott -", "IHG -"]
    for prefix in prefixes:
        name = name.replace(prefix, "").strip()
    
    # Normalize Unicode
    name = unicodedata.normalize('NFC', name)
    
    # Remove extra whitespace
    name = ' '.join(name.split())
    
    return name

2. Address Formatting

Problems:

  • Different field structures (street1/street2 vs. single line)
  • Abbreviations: "St" vs "Street", "Ave" vs "Avenue"
  • Postal codes: with/without spaces ("75008" vs "75 008")
  • Country codes: ISO-2 ("FR") vs ISO-3 ("FRA") vs full name ("France")

Solution:

  • Parse into standard components: street, city, postal_code, country
  • Expand abbreviations
  • Standardize country codes (use ISO-2 as canonical)
  • Validate against geocoding APIs when possible

Example:

def normalize_address(raw_address):
    # Parse components
    parsed = parse_address(raw_address)
    
    # Expand abbreviations
    street = expand_abbreviations(parsed.street)
    
    # Standardize country code
    country_code = to_iso2(parsed.country)
    
    return {
        'street': street,
        'city': parsed.city.title(),
        'postal_code': parsed.postal_code.replace(' ', ''),
        'country_code': country_code,
        'country_name': COUNTRIES[country_code]
    }

3. Coordinate Precision

Problems:

  • Different precision: 48.876100 vs 48.8761 vs 48.88
  • String vs. number: "48.876100" vs 48.8761
  • Reversed lat/long: some suppliers swap them
  • Incorrect coordinates: hotel in Paris with London coordinates

Solution:

  • Standardize to 6 decimal places (~0.1 meter precision)
  • Always store as floats
  • Validate: latitude (-90 to 90), longitude (-180 to 180)
  • Cross-check with city/country (flag if > 100km away)

Example:

def normalize_coordinates(lat, lon, city, country):
    # Convert to float
    lat = float(lat)
    lon = float(lon)
    
    # Round to 6 decimal places
    lat = round(lat, 6)
    lon = round(lon, 6)
    
    # Validate ranges
    if not (-90 <= lat <= 90 and -180 <= lon <= 180):
        raise ValueError(f"Invalid coordinates: {lat}, {lon}")
    
    # Validate against city (using geocoding service)
    expected_location = geocode(f"{city}, {country}")
    distance = haversine(lat, lon, expected_location.lat, expected_location.lon)
    
    if distance > 100:  # More than 100km from city center
        log_warning(f"Suspicious coordinates: {distance}km from {city}")
    
    return lat, lon

4. Star Ratings

Problems:

  • Different scales: 1-5 stars, 1-7 stars, letter grades (A-E)
  • Half stars: 4.5 represented as "4.5" or "4" or "5"
  • Unofficial ratings: "3-star equivalent"
  • Missing data: null vs 0 vs -1

Solution:

  • Standardize to 0-5 scale (allowing half stars)
  • Map alternative scales (A→5, B→4, etc.)
  • Store "official" flag (true if government-certified)
  • Handle null explicitly (unknown vs. unrated)

Example:

def normalize_star_rating(rating, scale_type):
    if rating is None:
        return None
    
    # Convert to standard 0-5 scale
    if scale_type == '1-7':
        return (float(rating) / 7) * 5
    elif scale_type == 'letter':
        letter_map = {'A': 5, 'B': 4, 'C': 3, 'D': 2, 'E': 1}
        return letter_map.get(rating.upper())
    else:  # Assume 1-5
        return float(rating)

5. Amenities

Problems:

  • Freeform text: "WIFI,PARKING,RESTAURANT"
  • Verbose descriptions: "Complimentary wireless internet access throughout the property"
  • Different codes: "WIFI" vs "FREE_WIFI" vs "INTERNET"
  • Inconsistent detail: "Parking" vs "Parking (Paid)" vs "Valet Parking ($25/day)"

Solution:

  • Define standard amenity taxonomy
  • Map supplier codes to canonical codes
  • Extract metadata (free/paid, availability, price)
  • Store both code (for filtering) and display name (for UI)

Example taxonomy:

AMENITIES = {
    'WIFI': {
        'name': 'WiFi',
        'category': 'internet',
        'variants': ['FREE_WIFI', 'WIRELESS', 'INTERNET', 'WIFI_FREE']
    },
    'PARKING': {
        'name': 'Parking',
        'category': 'parking',
        'variants': ['CAR_PARK', 'PARKING_LOT', 'VALET']
    },
    # ... more amenities
}

def normalize_amenity(raw_amenity):
    # Find matching canonical amenity
    for code, meta in AMENITIES.items():
        if raw_amenity.upper() in meta['variants']:
            return {
                'code': code,
                'name': meta['name'],
                'category': meta['category']
            }
    
    # Unknown amenity
    return {
        'code': 'OTHER',
        'name': raw_amenity.title(),
        'category': 'other'
    }

6. Chain and Brand

Problems:

  • Inconsistent naming: "Hilton" vs "Hilton Hotels" vs "Hilton Worldwide"
  • Brands vs. chains: "Hilton Garden Inn" (brand) vs "Hilton" (chain)
  • Acquisitions: Starwood → Marriott (legacy data still says Starwood)
  • Missing data: Independent hotels with no chain

Solution:

  • Maintain chain/brand hierarchy
  • Normalize chain names
  • Map legacy brands to current owners
  • Allow null for independent hotels

Example:

CHAINS = {
    'HILTON': {
        'name': 'Hilton',
        'brands': [
            'Hilton Hotels & Resorts',
            'Hilton Garden Inn',
            'Hampton by Hilton',
            'DoubleTree by Hilton'
        ]
    },
    'MARRIOTT': {
        'name': 'Marriott International',
        'brands': [
            'Marriott Hotels',
            'Sheraton',  # Acquired from Starwood
            'Westin',    # Acquired from Starwood
            'Courtyard by Marriott'
        ]
    }
}

def normalize_chain(hotel_name):
    for chain_code, chain_meta in CHAINS.items():
        # Check if any brand appears in hotel name
        for brand in chain_meta['brands']:
            if brand.lower() in hotel_name.lower():
                return {
                    'chain_code': chain_code,
                    'chain_name': chain_meta['name'],
                    'brand_name': brand
                }
    
    return None  # Independent hotel

Building a Normalization Pipeline

Stage 1: Ingestion

  • Fetch raw data from suppliers (API, FTP, database)
  • Store raw JSON/XML in staging database
  • Preserve original data for debugging

Stage 2: Parsing

  • Convert to internal representation
  • Extract fields based on supplier schema
  • Handle missing fields gracefully

Stage 3: Normalization

  • Apply transformations (case, formatting)
  • Validate data types and ranges
  • Enrich with external data (geocoding, chain lookup)

Stage 4: Matching

  • Fuzzy match hotel names
  • Validate with geographic proximity
  • Assign master hotel ID

Stage 5: Merging

  • Combine data from multiple suppliers for same hotel
  • Resolve conflicts (different addresses, different star ratings)
  • Choose best data based on supplier priority

Stage 6: Publishing

  • Write to production database
  • Update search indexes
  • Invalidate caches

Example pipeline code:

def normalize_hotel_pipeline(raw_data, supplier_id):
    # Stage 1: Parse
    parsed = parse_supplier_format(raw_data, supplier_id)
    
    # Stage 2: Normalize fields
    normalized = {
        'name': normalize_hotel_name(parsed.name),
        'address': normalize_address(parsed.address),
        'location': normalize_coordinates(
            parsed.lat, parsed.lon, parsed.city, parsed.country
        ),
        'stars': normalize_star_rating(parsed.stars, parsed.star_scale),
        'amenities': [normalize_amenity(a) for a in parsed.amenities],
        'chain': normalize_chain(parsed.name)
    }
    
    # Stage 3: Match to master hotel
    master_id = match_hotel(
        name=normalized['name'],
        city=normalized['address']['city'],
        lat=normalized['location']['latitude'],
        lon=normalized['location']['longitude']
    )
    
    # Stage 4: Store mapping
    store_mapping(master_id, supplier_id, parsed.id, normalized)
    
    return master_id

Data Quality Monitoring

Track normalization effectiveness:

Completeness

  • % of hotels with address
  • % with coordinates
  • % with star rating
  • % with chain/brand

Accuracy

  • % of coordinates validated by geocoding
  • % of amenities mapped to taxonomy
  • % of chain assignments verified

Consistency

  • Duplicate detection rate
  • Conflicting data resolution rate
  • Supplier agreement score

Example monitoring dashboard:

-- Completeness report
SELECT
  supplier_id,
  COUNT(*) as total_hotels,
  SUM(CASE WHEN latitude IS NOT NULL THEN 1 ELSE 0 END) as with_coords,
  SUM(CASE WHEN star_rating IS NOT NULL THEN 1 ELSE 0 END) as with_stars,
  SUM(CASE WHEN chain_code IS NOT NULL THEN 1 ELSE 0 END) as with_chain
FROM normalized_hotels
GROUP BY supplier_id;

-- Accuracy report
SELECT
  supplier_id,
  AVG(coordinate_confidence) as avg_coord_confidence,
  AVG(amenity_match_rate) as avg_amenity_match_rate
FROM data_quality_metrics
GROUP BY supplier_id;

Tools and Libraries

Address Parsing

  • libpostal: Statistical NLP library for parsing addresses (40+ languages)
  • pyap: Python address parsing library
  • Google Geocoding API: Validate and enrich addresses

Fuzzy Matching

  • rapidfuzz: Fast string matching (what mapping.travel uses)
  • fuzzywuzzy: Python fuzzy string matching
  • dedupe: ML-powered deduplication library

Data Validation

  • Pydantic: Python data validation using type annotations
  • Cerberus: Lightweight schema validation
  • jsonschema: JSON Schema validation

Geocoding

  • Nominatim: Open-source geocoding (OpenStreetMap)
  • Pelias: Self-hosted geocoding engine
  • Google/Mapbox APIs: Commercial geocoding services

Best Practices

1. Preserve Raw Data

Always keep original supplier data:

  • Enables debugging ("why was this normalized this way?")
  • Allows re-normalization if logic changes
  • Supports supplier issue reporting

2. Make Normalization Idempotent

Running normalization twice should produce the same result:

  • No random IDs or timestamps in output
  • Deterministic matching algorithms
  • Version control for normalization rules

3. Log Transformations

Track what changed during normalization:

{
  "hotel_id": "HOTEL_00001",
  "transformations": [
    {"field": "name", "from": "HILTON PARIS OPERA", "to": "Hilton Paris Opera"},
    {"field": "country_code", "from": "France", "to": "FR"},
    {"field": "stars", "from": "4", "to": 4.0}
  ]
}

4. Surface Confidence Scores

Not all normalized data is equally reliable:

  • High confidence: Verified coordinates, official star rating
  • Medium confidence: Fuzzy-matched chain, parsed amenities
  • Low confidence: Missing data filled with defaults

5. Enable Manual Overrides

Sometimes automated normalization is wrong:

  • Allow data team to manually correct
  • Store override reason
  • Don't overwrite on re-normalization

Normalization as a Service

Building robust normalization is complex. mapping.travel provides:

  • Pre-normalized hotel database: 1M+ hotels, standardized format
  • Normalization API: Send raw data, receive normalized output
  • Matching included: Automatic deduplication and master ID assignment
  • Continuous updates: New hotels, closures, rebranding

Example API call:

curl -X POST https://api.mapping.travel/v1/normalize \
  -H "Content-Type: application/json" \
  -d '{
    "name": "HILTON PARIS OPERA",
    "address": "108 RUE SAINT LAZARE",
    "city": "PARIS",
    "country": "FR",
    "latitude": "48.876100",
    "longitude": "2.326600"
  }'

Response:

{
  "master_id": "HOTEL_00001",
  "normalized": {
    "name": "Hilton Paris Opera",
    "address": {
      "street": "108 Rue Saint-Lazare",
      "city": "Paris",
      "postal_code": "75008",
      "country_code": "FR"
    },
    "location": {
      "latitude": 48.8761,
      "longitude": 2.3266
    }
  },
  "confidence": 0.98
}

Get Started

Ready to normalize your hotel data?

Clean, consistent data is the foundation of reliable travel applications. Let's build it together.


Questions about hotel data normalization? Join our Discord community or email hello@mapping.travel.