Why Hotel Data Normalization Matters in Travel APIs
By Engineering Team
If you're building a travel application that aggregates hotel data from multiple sources, you've discovered a painful truth: supplier data is wildly inconsistent.
Different formats, different naming conventions, different schemas, different quality standards. Without normalization, your API becomes an unreliable mess.
Here's why hotel data normalization matters and how to do it right.
The Problem: Raw Supplier Data Is Chaos
Let's look at the same hotel from three different suppliers:
Supplier A (Bedbank)
{
"hotel_id": "HTL_12345",
"hotel_name": "HILTON PARIS OPERA",
"address": "108 RUE SAINT LAZARE",
"city": "PARIS",
"country": "FR",
"latitude": "48.876100",
"longitude": "2.326600",
"star_rating": "4",
"amenities": "WIFI,PARKING,RESTAURANT"
}
Supplier B (GDS)
{
"propertyCode": "PAR_HIL_987",
"name": "Hilton Paris Opera",
"addr1": "108 Rue Saint-Lazare",
"addr2": "",
"cityName": "Paris",
"countryCode": "France",
"lat": 48.8761,
"lon": 2.3266,
"category": "UPSCALE",
"facilities": ["Free WiFi", "Parking Available", "On-site Dining"]
}
Supplier C (Direct Connect)
{
"id": "5678-FR",
"hotelName": "Hilton Opera Paris",
"location": {
"street": "108 rue Saint Lazare",
"city": "Paris",
"postalCode": "75008",
"country": "FRA",
"coordinates": {
"latitude": 48.876100,
"longitude": 2.326600
}
},
"classification": {
"stars": 4,
"category": "Premium"
},
"services": {
"wifi": true,
"parking": {"available": true, "type": "Paid"},
"food_beverage": {"restaurant": true, "bar": true}
}
}
Same hotel. Completely different formats.
What Data Normalization Solves
Normalization transforms disparate data into a consistent, canonical format:
{
"id": "HOTEL_00001",
"name": "Hilton Paris Opera",
"address": {
"street": "108 Rue Saint-Lazare",
"city": "Paris",
"postal_code": "75008",
"country_code": "FR",
"country_name": "France"
},
"location": {
"latitude": 48.8761,
"longitude": 2.3266
},
"classification": {
"stars": 4,
"category": "upscale"
},
"amenities": [
{"code": "WIFI", "name": "Free WiFi"},
{"code": "PARKING", "name": "Parking (Paid)"},
{"code": "RESTAURANT", "name": "Restaurant"}
],
"supplier_mappings": [
{"supplier": "supplier_a", "id": "HTL_12345"},
{"supplier": "supplier_b", "id": "PAR_HIL_987"},
{"supplier": "supplier_c", "id": "5678-FR"}
]
}
This enables:
- Consistent API responses
- Reliable filtering and search
- Accurate deduplication
- Cross-supplier comparison
- Easier downstream processing
Key Normalization Challenges
1. Name Standardization
Problems:
- ALL CAPS vs. Title Case vs. Sentence case
- Special characters: "Hôtel" vs "Hotel"
- Chain prefixes: "Hilton - Paris Opera" vs "Hilton Paris Opera"
- Word order: "Opera Paris Hilton" vs "Hilton Paris Opera"
Solution:
- Define canonical format (e.g., Title Case)
- Strip chain prefixes
- Normalize Unicode (NFD → NFC)
- Store variations for search, display canonical
Example normalization logic:
def normalize_hotel_name(name):
# Convert to title case
name = name.title()
# Remove common chain prefixes
prefixes = ["Hilton Hotels -", "Marriott -", "IHG -"]
for prefix in prefixes:
name = name.replace(prefix, "").strip()
# Normalize Unicode
name = unicodedata.normalize('NFC', name)
# Remove extra whitespace
name = ' '.join(name.split())
return name
2. Address Formatting
Problems:
- Different field structures (street1/street2 vs. single line)
- Abbreviations: "St" vs "Street", "Ave" vs "Avenue"
- Postal codes: with/without spaces ("75008" vs "75 008")
- Country codes: ISO-2 ("FR") vs ISO-3 ("FRA") vs full name ("France")
Solution:
- Parse into standard components: street, city, postal_code, country
- Expand abbreviations
- Standardize country codes (use ISO-2 as canonical)
- Validate against geocoding APIs when possible
Example:
def normalize_address(raw_address):
# Parse components
parsed = parse_address(raw_address)
# Expand abbreviations
street = expand_abbreviations(parsed.street)
# Standardize country code
country_code = to_iso2(parsed.country)
return {
'street': street,
'city': parsed.city.title(),
'postal_code': parsed.postal_code.replace(' ', ''),
'country_code': country_code,
'country_name': COUNTRIES[country_code]
}
3. Coordinate Precision
Problems:
- Different precision:
48.876100vs48.8761vs48.88 - String vs. number:
"48.876100"vs48.8761 - Reversed lat/long: some suppliers swap them
- Incorrect coordinates: hotel in Paris with London coordinates
Solution:
- Standardize to 6 decimal places (~0.1 meter precision)
- Always store as floats
- Validate: latitude (-90 to 90), longitude (-180 to 180)
- Cross-check with city/country (flag if > 100km away)
Example:
def normalize_coordinates(lat, lon, city, country):
# Convert to float
lat = float(lat)
lon = float(lon)
# Round to 6 decimal places
lat = round(lat, 6)
lon = round(lon, 6)
# Validate ranges
if not (-90 <= lat <= 90 and -180 <= lon <= 180):
raise ValueError(f"Invalid coordinates: {lat}, {lon}")
# Validate against city (using geocoding service)
expected_location = geocode(f"{city}, {country}")
distance = haversine(lat, lon, expected_location.lat, expected_location.lon)
if distance > 100: # More than 100km from city center
log_warning(f"Suspicious coordinates: {distance}km from {city}")
return lat, lon
4. Star Ratings
Problems:
- Different scales: 1-5 stars, 1-7 stars, letter grades (A-E)
- Half stars: 4.5 represented as "4.5" or "4" or "5"
- Unofficial ratings: "3-star equivalent"
- Missing data: null vs 0 vs -1
Solution:
- Standardize to 0-5 scale (allowing half stars)
- Map alternative scales (A→5, B→4, etc.)
- Store "official" flag (true if government-certified)
- Handle null explicitly (unknown vs. unrated)
Example:
def normalize_star_rating(rating, scale_type):
if rating is None:
return None
# Convert to standard 0-5 scale
if scale_type == '1-7':
return (float(rating) / 7) * 5
elif scale_type == 'letter':
letter_map = {'A': 5, 'B': 4, 'C': 3, 'D': 2, 'E': 1}
return letter_map.get(rating.upper())
else: # Assume 1-5
return float(rating)
5. Amenities
Problems:
- Freeform text: "WIFI,PARKING,RESTAURANT"
- Verbose descriptions: "Complimentary wireless internet access throughout the property"
- Different codes: "WIFI" vs "FREE_WIFI" vs "INTERNET"
- Inconsistent detail: "Parking" vs "Parking (Paid)" vs "Valet Parking ($25/day)"
Solution:
- Define standard amenity taxonomy
- Map supplier codes to canonical codes
- Extract metadata (free/paid, availability, price)
- Store both code (for filtering) and display name (for UI)
Example taxonomy:
AMENITIES = {
'WIFI': {
'name': 'WiFi',
'category': 'internet',
'variants': ['FREE_WIFI', 'WIRELESS', 'INTERNET', 'WIFI_FREE']
},
'PARKING': {
'name': 'Parking',
'category': 'parking',
'variants': ['CAR_PARK', 'PARKING_LOT', 'VALET']
},
# ... more amenities
}
def normalize_amenity(raw_amenity):
# Find matching canonical amenity
for code, meta in AMENITIES.items():
if raw_amenity.upper() in meta['variants']:
return {
'code': code,
'name': meta['name'],
'category': meta['category']
}
# Unknown amenity
return {
'code': 'OTHER',
'name': raw_amenity.title(),
'category': 'other'
}
6. Chain and Brand
Problems:
- Inconsistent naming: "Hilton" vs "Hilton Hotels" vs "Hilton Worldwide"
- Brands vs. chains: "Hilton Garden Inn" (brand) vs "Hilton" (chain)
- Acquisitions: Starwood → Marriott (legacy data still says Starwood)
- Missing data: Independent hotels with no chain
Solution:
- Maintain chain/brand hierarchy
- Normalize chain names
- Map legacy brands to current owners
- Allow null for independent hotels
Example:
CHAINS = {
'HILTON': {
'name': 'Hilton',
'brands': [
'Hilton Hotels & Resorts',
'Hilton Garden Inn',
'Hampton by Hilton',
'DoubleTree by Hilton'
]
},
'MARRIOTT': {
'name': 'Marriott International',
'brands': [
'Marriott Hotels',
'Sheraton', # Acquired from Starwood
'Westin', # Acquired from Starwood
'Courtyard by Marriott'
]
}
}
def normalize_chain(hotel_name):
for chain_code, chain_meta in CHAINS.items():
# Check if any brand appears in hotel name
for brand in chain_meta['brands']:
if brand.lower() in hotel_name.lower():
return {
'chain_code': chain_code,
'chain_name': chain_meta['name'],
'brand_name': brand
}
return None # Independent hotel
Building a Normalization Pipeline
Stage 1: Ingestion
- Fetch raw data from suppliers (API, FTP, database)
- Store raw JSON/XML in staging database
- Preserve original data for debugging
Stage 2: Parsing
- Convert to internal representation
- Extract fields based on supplier schema
- Handle missing fields gracefully
Stage 3: Normalization
- Apply transformations (case, formatting)
- Validate data types and ranges
- Enrich with external data (geocoding, chain lookup)
Stage 4: Matching
- Fuzzy match hotel names
- Validate with geographic proximity
- Assign master hotel ID
Stage 5: Merging
- Combine data from multiple suppliers for same hotel
- Resolve conflicts (different addresses, different star ratings)
- Choose best data based on supplier priority
Stage 6: Publishing
- Write to production database
- Update search indexes
- Invalidate caches
Example pipeline code:
def normalize_hotel_pipeline(raw_data, supplier_id):
# Stage 1: Parse
parsed = parse_supplier_format(raw_data, supplier_id)
# Stage 2: Normalize fields
normalized = {
'name': normalize_hotel_name(parsed.name),
'address': normalize_address(parsed.address),
'location': normalize_coordinates(
parsed.lat, parsed.lon, parsed.city, parsed.country
),
'stars': normalize_star_rating(parsed.stars, parsed.star_scale),
'amenities': [normalize_amenity(a) for a in parsed.amenities],
'chain': normalize_chain(parsed.name)
}
# Stage 3: Match to master hotel
master_id = match_hotel(
name=normalized['name'],
city=normalized['address']['city'],
lat=normalized['location']['latitude'],
lon=normalized['location']['longitude']
)
# Stage 4: Store mapping
store_mapping(master_id, supplier_id, parsed.id, normalized)
return master_id
Data Quality Monitoring
Track normalization effectiveness:
Completeness
- % of hotels with address
- % with coordinates
- % with star rating
- % with chain/brand
Accuracy
- % of coordinates validated by geocoding
- % of amenities mapped to taxonomy
- % of chain assignments verified
Consistency
- Duplicate detection rate
- Conflicting data resolution rate
- Supplier agreement score
Example monitoring dashboard:
-- Completeness report
SELECT
supplier_id,
COUNT(*) as total_hotels,
SUM(CASE WHEN latitude IS NOT NULL THEN 1 ELSE 0 END) as with_coords,
SUM(CASE WHEN star_rating IS NOT NULL THEN 1 ELSE 0 END) as with_stars,
SUM(CASE WHEN chain_code IS NOT NULL THEN 1 ELSE 0 END) as with_chain
FROM normalized_hotels
GROUP BY supplier_id;
-- Accuracy report
SELECT
supplier_id,
AVG(coordinate_confidence) as avg_coord_confidence,
AVG(amenity_match_rate) as avg_amenity_match_rate
FROM data_quality_metrics
GROUP BY supplier_id;
Tools and Libraries
Address Parsing
- libpostal: Statistical NLP library for parsing addresses (40+ languages)
- pyap: Python address parsing library
- Google Geocoding API: Validate and enrich addresses
Fuzzy Matching
- rapidfuzz: Fast string matching (what mapping.travel uses)
- fuzzywuzzy: Python fuzzy string matching
- dedupe: ML-powered deduplication library
Data Validation
- Pydantic: Python data validation using type annotations
- Cerberus: Lightweight schema validation
- jsonschema: JSON Schema validation
Geocoding
- Nominatim: Open-source geocoding (OpenStreetMap)
- Pelias: Self-hosted geocoding engine
- Google/Mapbox APIs: Commercial geocoding services
Best Practices
1. Preserve Raw Data
Always keep original supplier data:
- Enables debugging ("why was this normalized this way?")
- Allows re-normalization if logic changes
- Supports supplier issue reporting
2. Make Normalization Idempotent
Running normalization twice should produce the same result:
- No random IDs or timestamps in output
- Deterministic matching algorithms
- Version control for normalization rules
3. Log Transformations
Track what changed during normalization:
{
"hotel_id": "HOTEL_00001",
"transformations": [
{"field": "name", "from": "HILTON PARIS OPERA", "to": "Hilton Paris Opera"},
{"field": "country_code", "from": "France", "to": "FR"},
{"field": "stars", "from": "4", "to": 4.0}
]
}
4. Surface Confidence Scores
Not all normalized data is equally reliable:
- High confidence: Verified coordinates, official star rating
- Medium confidence: Fuzzy-matched chain, parsed amenities
- Low confidence: Missing data filled with defaults
5. Enable Manual Overrides
Sometimes automated normalization is wrong:
- Allow data team to manually correct
- Store override reason
- Don't overwrite on re-normalization
Normalization as a Service
Building robust normalization is complex. mapping.travel provides:
- Pre-normalized hotel database: 1M+ hotels, standardized format
- Normalization API: Send raw data, receive normalized output
- Matching included: Automatic deduplication and master ID assignment
- Continuous updates: New hotels, closures, rebranding
Example API call:
curl -X POST https://api.mapping.travel/v1/normalize \
-H "Content-Type: application/json" \
-d '{
"name": "HILTON PARIS OPERA",
"address": "108 RUE SAINT LAZARE",
"city": "PARIS",
"country": "FR",
"latitude": "48.876100",
"longitude": "2.326600"
}'
Response:
{
"master_id": "HOTEL_00001",
"normalized": {
"name": "Hilton Paris Opera",
"address": {
"street": "108 Rue Saint-Lazare",
"city": "Paris",
"postal_code": "75008",
"country_code": "FR"
},
"location": {
"latitude": 48.8761,
"longitude": 2.3266
}
},
"confidence": 0.98
}
Get Started
Ready to normalize your hotel data?
- Try the normalization API - Free tier available
- View the schema docs - Standard data model
- Explore the codebase - Open-source normalization pipeline
Clean, consistent data is the foundation of reliable travel applications. Let's build it together.
Questions about hotel data normalization? Join our Discord community or email hello@mapping.travel.