EngineeringJanuary 27, 2026•9 min read

Why Hotel Data Normalization Matters in Travel APIs

By Engineering Team

If you're building a travel application that aggregates hotel data from multiple sources, you've discovered a painful truth: supplier data is wildly inconsistent.

Different formats, different naming conventions, different schemas, different quality standards. Without normalization, your API becomes an unreliable mess.

Here's why hotel data normalization matters and how to do it right.

The Problem: Raw Supplier Data Is Chaos

Let's look at the same hotel from three different suppliers:

Supplier A (Bedbank)

{
  "hotel_id": "HTL_12345",
  "hotel_name": "HILTON PARIS OPERA",
  "address": "108 RUE SAINT LAZARE",
  "city": "PARIS",
  "country": "FR",
  "latitude": "48.876100",
  "longitude": "2.326600",
  "star_rating": "4",
  "amenities": "WIFI,PARKING,RESTAURANT"
}

Supplier B (GDS)

{
  "propertyCode": "PAR_HIL_987",
  "name": "Hilton Paris Opera",
  "addr1": "108 Rue Saint-Lazare",
  "addr2": "",
  "cityName": "Paris",
  "countryCode": "France",
  "lat": 48.8761,
  "lon": 2.3266,
  "category": "UPSCALE",
  "facilities": ["Free WiFi", "Parking Available", "On-site Dining"]
}

Supplier C (Direct Connect)

{
  "id": "5678-FR",
  "hotelName": "Hilton Opera Paris",
  "location": {
    "street": "108 rue Saint Lazare",
    "city": "Paris",
    "postalCode": "75008",
    "country": "FRA",
    "coordinates": {
      "latitude": 48.876100,
      "longitude": 2.326600
    }
  },
  "classification": {
    "stars": 4,
    "category": "Premium"
  },
  "services": {
    "wifi": true,
    "parking": {"available": true, "type": "Paid"},
    "food_beverage": {"restaurant": true, "bar": true}
  }
}

Same hotel. Completely different formats.

What Data Normalization Solves

Normalization transforms disparate data into a consistent, canonical format:

{
  "id": "HOTEL_00001",
  "name": "Hilton Paris Opera",
  "address": {
    "street": "108 Rue Saint-Lazare",
    "city": "Paris",
    "postal_code": "75008",
    "country_code": "FR",
    "country_name": "France"
  },
  "location": {
    "latitude": 48.8761,
    "longitude": 2.3266
  },
  "classification": {
    "stars": 4,
    "category": "upscale"
  },
  "amenities": [
    {"code": "WIFI", "name": "Free WiFi"},
    {"code": "PARKING", "name": "Parking (Paid)"},
    {"code": "RESTAURANT", "name": "Restaurant"}
  ],
  "supplier_mappings": [
    {"supplier": "supplier_a", "id": "HTL_12345"},
    {"supplier": "supplier_b", "id": "PAR_HIL_987"},
    {"supplier": "supplier_c", "id": "5678-FR"}
  ]
}

This enables:

Consistent API responses
Reliable filtering and search
Accurate deduplication
Cross-supplier comparison
Easier downstream processing

Key Normalization Challenges

1. Name Standardization

Problems:

ALL CAPS vs. Title Case vs. Sentence case
Special characters: "Hôtel" vs "Hotel"
Chain prefixes: "Hilton - Paris Opera" vs "Hilton Paris Opera"
Word order: "Opera Paris Hilton" vs "Hilton Paris Opera"

Solution:

Define canonical format (e.g., Title Case)
Strip chain prefixes
Normalize Unicode (NFD → NFC)
Store variations for search, display canonical

Example normalization logic:

def normalize_hotel_name(name):
    # Convert to title case
    name = name.title()
    
    # Remove common chain prefixes
    prefixes = ["Hilton Hotels -", "Marriott -", "IHG -"]
    for prefix in prefixes:
        name = name.replace(prefix, "").strip()
    
    # Normalize Unicode
    name = unicodedata.normalize('NFC', name)
    
    # Remove extra whitespace
    name = ' '.join(name.split())
    
    return name

2. Address Formatting

Problems:

Different field structures (street1/street2 vs. single line)
Abbreviations: "St" vs "Street", "Ave" vs "Avenue"
Postal codes: with/without spaces ("75008" vs "75 008")
Country codes: ISO-2 ("FR") vs ISO-3 ("FRA") vs full name ("France")

Solution:

Parse into standard components: street, city, postal_code, country
Expand abbreviations
Standardize country codes (use ISO-2 as canonical)
Validate against geocoding APIs when possible

Example:

def normalize_address(raw_address):
    # Parse components
    parsed = parse_address(raw_address)
    
    # Expand abbreviations
    street = expand_abbreviations(parsed.street)
    
    # Standardize country code
    country_code = to_iso2(parsed.country)
    
    return {
        'street': street,
        'city': parsed.city.title(),
        'postal_code': parsed.postal_code.replace(' ', ''),
        'country_code': country_code,
        'country_name': COUNTRIES[country_code]
    }

3. Coordinate Precision

Problems:

Different precision: 48.876100 vs 48.8761 vs 48.88
String vs. number: "48.876100" vs 48.8761
Reversed lat/long: some suppliers swap them
Incorrect coordinates: hotel in Paris with London coordinates

Solution:

Standardize to 6 decimal places (~0.1 meter precision)
Always store as floats
Validate: latitude (-90 to 90), longitude (-180 to 180)
Cross-check with city/country (flag if > 100km away)

Example:

def normalize_coordinates(lat, lon, city, country):
    # Convert to float
    lat = float(lat)
    lon = float(lon)
    
    # Round to 6 decimal places
    lat = round(lat, 6)
    lon = round(lon, 6)
    
    # Validate ranges
    if not (-90 <= lat <= 90 and -180 <= lon <= 180):
        raise ValueError(f"Invalid coordinates: {lat}, {lon}")
    
    # Validate against city (using geocoding service)
    expected_location = geocode(f"{city}, {country}")
    distance = haversine(lat, lon, expected_location.lat, expected_location.lon)
    
    if distance > 100:  # More than 100km from city center
        log_warning(f"Suspicious coordinates: {distance}km from {city}")
    
    return lat, lon

4. Star Ratings

Problems:

Different scales: 1-5 stars, 1-7 stars, letter grades (A-E)
Half stars: 4.5 represented as "4.5" or "4" or "5"
Unofficial ratings: "3-star equivalent"
Missing data: null vs 0 vs -1

Solution:

Standardize to 0-5 scale (allowing half stars)
Map alternative scales (A→5, B→4, etc.)
Store "official" flag (true if government-certified)
Handle null explicitly (unknown vs. unrated)

Example:

def normalize_star_rating(rating, scale_type):
    if rating is None:
        return None
    
    # Convert to standard 0-5 scale
    if scale_type == '1-7':
        return (float(rating) / 7) * 5
    elif scale_type == 'letter':
        letter_map = {'A': 5, 'B': 4, 'C': 3, 'D': 2, 'E': 1}
        return letter_map.get(rating.upper())
    else:  # Assume 1-5
        return float(rating)

5. Amenities

Problems:

Freeform text: "WIFI,PARKING,RESTAURANT"
Verbose descriptions: "Complimentary wireless internet access throughout the property"
Different codes: "WIFI" vs "FREE_WIFI" vs "INTERNET"
Inconsistent detail: "Parking" vs "Parking (Paid)" vs "Valet Parking ($25/day)"

Solution:

Define standard amenity taxonomy
Map supplier codes to canonical codes
Extract metadata (free/paid, availability, price)
Store both code (for filtering) and display name (for UI)

Example taxonomy:

AMENITIES = {
    'WIFI': {
        'name': 'WiFi',
        'category': 'internet',
        'variants': ['FREE_WIFI', 'WIRELESS', 'INTERNET', 'WIFI_FREE']
    },
    'PARKING': {
        'name': 'Parking',
        'category': 'parking',
        'variants': ['CAR_PARK', 'PARKING_LOT', 'VALET']
    },
    # ... more amenities
}

def normalize_amenity(raw_amenity):
    # Find matching canonical amenity
    for code, meta in AMENITIES.items():
        if raw_amenity.upper() in meta['variants']:
            return {
                'code': code,
                'name': meta['name'],
                'category': meta['category']
            }
    
    # Unknown amenity
    return {
        'code': 'OTHER',
        'name': raw_amenity.title(),
        'category': 'other'
    }

6. Chain and Brand

Problems:

Inconsistent naming: "Hilton" vs "Hilton Hotels" vs "Hilton Worldwide"
Brands vs. chains: "Hilton Garden Inn" (brand) vs "Hilton" (chain)
Acquisitions: Starwood → Marriott (legacy data still says Starwood)
Missing data: Independent hotels with no chain

Solution:

Maintain chain/brand hierarchy
Normalize chain names
Map legacy brands to current owners
Allow null for independent hotels

Example:

CHAINS = {
    'HILTON': {
        'name': 'Hilton',
        'brands': [
            'Hilton Hotels & Resorts',
            'Hilton Garden Inn',
            'Hampton by Hilton',
            'DoubleTree by Hilton'
        ]
    },
    'MARRIOTT': {
        'name': 'Marriott International',
        'brands': [
            'Marriott Hotels',
            'Sheraton',  # Acquired from Starwood
            'Westin',    # Acquired from Starwood
            'Courtyard by Marriott'
        ]
    }
}

def normalize_chain(hotel_name):
    for chain_code, chain_meta in CHAINS.items():
        # Check if any brand appears in hotel name
        for brand in chain_meta['brands']:
            if brand.lower() in hotel_name.lower():
                return {
                    'chain_code': chain_code,
                    'chain_name': chain_meta['name'],
                    'brand_name': brand
                }
    
    return None  # Independent hotel

Building a Normalization Pipeline

Stage 1: Ingestion

Fetch raw data from suppliers (API, FTP, database)
Store raw JSON/XML in staging database
Preserve original data for debugging

Stage 2: Parsing

Convert to internal representation
Extract fields based on supplier schema
Handle missing fields gracefully

Stage 3: Normalization

Apply transformations (case, formatting)
Validate data types and ranges
Enrich with external data (geocoding, chain lookup)

Stage 4: Matching

Fuzzy match hotel names
Validate with geographic proximity
Assign master hotel ID

Stage 5: Merging

Combine data from multiple suppliers for same hotel
Resolve conflicts (different addresses, different star ratings)
Choose best data based on supplier priority

Stage 6: Publishing

Write to production database
Update search indexes
Invalidate caches

Example pipeline code:

def normalize_hotel_pipeline(raw_data, supplier_id):
    # Stage 1: Parse
    parsed = parse_supplier_format(raw_data, supplier_id)
    
    # Stage 2: Normalize fields
    normalized = {
        'name': normalize_hotel_name(parsed.name),
        'address': normalize_address(parsed.address),
        'location': normalize_coordinates(
            parsed.lat, parsed.lon, parsed.city, parsed.country
        ),
        'stars': normalize_star_rating(parsed.stars, parsed.star_scale),
        'amenities': [normalize_amenity(a) for a in parsed.amenities],
        'chain': normalize_chain(parsed.name)
    }
    
    # Stage 3: Match to master hotel
    master_id = match_hotel(
        name=normalized['name'],
        city=normalized['address']['city'],
        lat=normalized['location']['latitude'],
        lon=normalized['location']['longitude']
    )
    
    # Stage 4: Store mapping
    store_mapping(master_id, supplier_id, parsed.id, normalized)
    
    return master_id

Data Quality Monitoring

Track normalization effectiveness:

Completeness

% of hotels with address
% with coordinates
% with star rating
% with chain/brand

Accuracy

% of coordinates validated by geocoding
% of amenities mapped to taxonomy
% of chain assignments verified

Consistency

Duplicate detection rate
Conflicting data resolution rate
Supplier agreement score

Example monitoring dashboard:

-- Completeness report
SELECT
  supplier_id,
  COUNT(*) as total_hotels,
  SUM(CASE WHEN latitude IS NOT NULL THEN 1 ELSE 0 END) as with_coords,
  SUM(CASE WHEN star_rating IS NOT NULL THEN 1 ELSE 0 END) as with_stars,
  SUM(CASE WHEN chain_code IS NOT NULL THEN 1 ELSE 0 END) as with_chain
FROM normalized_hotels
GROUP BY supplier_id;

-- Accuracy report
SELECT
  supplier_id,
  AVG(coordinate_confidence) as avg_coord_confidence,
  AVG(amenity_match_rate) as avg_amenity_match_rate
FROM data_quality_metrics
GROUP BY supplier_id;

Tools and Libraries

Address Parsing

libpostal: Statistical NLP library for parsing addresses (40+ languages)
pyap: Python address parsing library
Google Geocoding API: Validate and enrich addresses

Fuzzy Matching

rapidfuzz: Fast string matching (what mapping.travel uses)
fuzzywuzzy: Python fuzzy string matching
dedupe: ML-powered deduplication library

Data Validation

Pydantic: Python data validation using type annotations
Cerberus: Lightweight schema validation
jsonschema: JSON Schema validation

Geocoding

Nominatim: Open-source geocoding (OpenStreetMap)
Pelias: Self-hosted geocoding engine
Google/Mapbox APIs: Commercial geocoding services

Best Practices

1. Preserve Raw Data

Always keep original supplier data:

Enables debugging ("why was this normalized this way?")
Allows re-normalization if logic changes
Supports supplier issue reporting

2. Make Normalization Idempotent

Running normalization twice should produce the same result:

No random IDs or timestamps in output
Deterministic matching algorithms
Version control for normalization rules

3. Log Transformations

Track what changed during normalization:

{
  "hotel_id": "HOTEL_00001",
  "transformations": [
    {"field": "name", "from": "HILTON PARIS OPERA", "to": "Hilton Paris Opera"},
    {"field": "country_code", "from": "France", "to": "FR"},
    {"field": "stars", "from": "4", "to": 4.0}
  ]
}

4. Surface Confidence Scores

Not all normalized data is equally reliable:

High confidence: Verified coordinates, official star rating
Medium confidence: Fuzzy-matched chain, parsed amenities
Low confidence: Missing data filled with defaults

5. Enable Manual Overrides

Sometimes automated normalization is wrong:

Allow data team to manually correct
Store override reason
Don't overwrite on re-normalization

Normalization as a Service

Building robust normalization is complex. mapping.travel provides:

Pre-normalized hotel database: 1M+ hotels, standardized format
Normalization API: Send raw data, receive normalized output
Matching included: Automatic deduplication and master ID assignment
Continuous updates: New hotels, closures, rebranding

Example API call:

curl -X POST https://api.mapping.travel/v1/normalize \
  -H "Content-Type: application/json" \
  -d '{
    "name": "HILTON PARIS OPERA",
    "address": "108 RUE SAINT LAZARE",
    "city": "PARIS",
    "country": "FR",
    "latitude": "48.876100",
    "longitude": "2.326600"
  }'

Response:

{
  "master_id": "HOTEL_00001",
  "normalized": {
    "name": "Hilton Paris Opera",
    "address": {
      "street": "108 Rue Saint-Lazare",
      "city": "Paris",
      "postal_code": "75008",
      "country_code": "FR"
    },
    "location": {
      "latitude": 48.8761,
      "longitude": 2.3266
    }
  },
  "confidence": 0.98
}

Get Started

Ready to normalize your hotel data?

Try the normalization API - Free tier available
View the schema docs - Standard data model
Explore the codebase - Open-source normalization pipeline

Clean, consistent data is the foundation of reliable travel applications. Let's build it together.

Questions about hotel data normalization? Join our Discord community or email hello@mapping.travel.

Why Hotel Data Normalization Matters in Travel APIs

The Problem: Raw Supplier Data Is Chaos

Supplier A (Bedbank)

Supplier B (GDS)

Supplier C (Direct Connect)

What Data Normalization Solves

Key Normalization Challenges

1. Name Standardization

2. Address Formatting

3. Coordinate Precision

4. Star Ratings

5. Amenities

6. Chain and Brand

Building a Normalization Pipeline

Stage 1: Ingestion

Stage 2: Parsing

Stage 3: Normalization

Stage 4: Matching

Stage 5: Merging

Stage 6: Publishing

Data Quality Monitoring

Completeness

Accuracy

Consistency

Tools and Libraries

Address Parsing

Fuzzy Matching

Data Validation

Geocoding

Best Practices

1. Preserve Raw Data

2. Make Normalization Idempotent

3. Log Transformations

4. Surface Confidence Scores

5. Enable Manual Overrides

Normalization as a Service

Get Started

Related Posts

Understanding Hotel Content Aggregation

How Our AI-Powered Hotel Matching Works