Understanding Hotel Content Aggregation
By Engineering Team
Hotel content aggregation is the process of collecting, combining, and presenting hotel information from multiple data sources. It's fundamental to building any travel platform that offers comprehensive hotel inventory.
Here's how it works, why it's challenging, and how to do it right.
What Is Hotel Content Aggregation?
At its core, content aggregation means:
- Collecting data from multiple suppliers
- Matching hotels across suppliers (same hotel, different IDs)
- Merging data into a unified representation
- Enriching with additional data sources
- Serving via your application or API
Example: Building a hotel search for Paris
Without aggregation:
- Supplier A: 2,000 hotels
- Supplier B: 1,500 hotels
- Supplier C: 1,800 hotels
- Your platform: 2,000 hotels (just from Supplier A)
With aggregation:
- Your platform: 3,200 unique hotels (combining all suppliers, deduplicating)
Users get more choice, better prices, and higher availability.
Data Sources for Hotel Content
Primary Sources (Inventory Suppliers)
Bedbanks:
- HotelBeds
- Tourico
- WebBeds
- GTA
- Travco
GDS (Global Distribution Systems):
- Amadeus
- Sabre
- Travelport (Apollo, Galileo, Worldspan)
Direct Connects:
- Marriott API
- Hilton OnQ
- IHG Connect
- Accor D-Edge
OTA Affiliate Networks:
- Booking.com Partner Solutions
- Expedia Partner Solutions (EPS)
- Agoda Affiliate Program
Secondary Sources (Content Enrichment)
Review Data:
- TripAdvisor Content API
- Google Places API
- Trustpilot
- Scraped reviews (legal considerations apply)
Photos and Media:
- Unsplash
- Pexels
- Supplier-provided photos
- User-generated content
Points of Interest:
- Google Places
- Foursquare
- OpenStreetMap
Pricing Intelligence:
- OTA Insight
- RateGain
- Public rate shopping
The Aggregation Pipeline
Stage 1: Data Ingestion
Collect data from each supplier:
API Pulling:
def ingest_supplier_api(supplier_id):
client = get_supplier_client(supplier_id)
# Fetch hotel listings
hotels = client.get_hotels(
updated_since=last_sync_time,
page_size=1000
)
for hotel in hotels:
store_raw_data(
supplier_id=supplier_id,
hotel_id=hotel['id'],
data=hotel,
timestamp=datetime.now()
)
File Drops (FTP/SFTP):
def ingest_supplier_ftp(supplier_id):
ftp = connect_to_supplier_ftp(supplier_id)
# Download latest file
latest_file = ftp.get_latest('hotels_*.xml')
# Parse XML
hotels = parse_xml(latest_file)
# Store each hotel
for hotel in hotels:
store_raw_data(...)
Database Replication:
def ingest_supplier_db(supplier_id):
# Replicate from supplier's database
source_db = get_supplier_db_connection(supplier_id)
# Incremental sync
new_hotels = source_db.query("""
SELECT * FROM hotels
WHERE updated_at > %s
""", last_sync_time)
for hotel in new_hotels:
store_raw_data(...)
Stage 2: Normalization
Transform supplier-specific formats into a canonical schema:
def normalize_hotel(raw_data, supplier_id):
parser = get_supplier_parser(supplier_id)
# Parse supplier-specific format
parsed = parser.parse(raw_data)
# Normalize fields
normalized = {
'name': normalize_name(parsed.name),
'address': normalize_address(parsed.address),
'location': normalize_coordinates(parsed.lat, parsed.lon),
'stars': normalize_star_rating(parsed.stars),
'amenities': [normalize_amenity(a) for a in parsed.amenities],
'chain': detect_chain(parsed.name)
}
return normalized
See our detailed guide: Why Hotel Data Normalization Matters
Stage 3: Matching and Deduplication
Identify which hotels from different suppliers are the same property:
def match_hotel(normalized_hotel):
# Find candidates using fuzzy matching
candidates = fuzzy_search(
name=normalized_hotel['name'],
city=normalized_hotel['address']['city'],
limit=10
)
# Rerank using semantic model
matches = rerank_candidates(
query=normalized_hotel,
candidates=candidates
)
# Get best match above threshold
best_match = matches[0]
if best_match.score > 0.70:
return best_match.master_id
# No match found, create new master record
return create_new_master_hotel(normalized_hotel)
See our detailed guide: How Our AI-Powered Hotel Matching Works
Stage 4: Data Merging
Combine data from multiple suppliers for the same hotel:
Conflict Resolution Strategy:
Different suppliers may provide conflicting information:
Hotel: Hilton Paris Opera (MASTER_001)
Supplier A: 4 stars
Supplier B: 4 stars
Supplier C: 5 stars ← Outlier
→ Resolution: Use mode (4 stars)
Field Prioritization:
# Define which supplier is most reliable for each field
FIELD_PRIORITY = {
'name': ['direct_connect', 'gds', 'bedbank'],
'address': ['gds', 'direct_connect', 'bedbank'],
'coordinates': ['gds', 'bedbank', 'direct_connect'],
'photos': ['direct_connect', 'bedbank', 'gds'],
'amenities': ['direct_connect', 'gds', 'bedbank']
}
def merge_hotel_data(master_id):
# Get all supplier data for this hotel
supplier_data = get_all_supplier_data(master_id)
merged = {}
for field in FIELDS:
# Try suppliers in priority order
for supplier_type in FIELD_PRIORITY[field]:
value = get_field_from_supplier(supplier_data, supplier_type, field)
if value is not None:
merged[field] = value
break
return merged
Content Combination:
For lists (amenities, photos), combine from all suppliers:
def merge_amenities(supplier_data):
all_amenities = set()
for supplier_id, data in supplier_data.items():
all_amenities.update(data.get('amenities', []))
# Deduplicate and normalize
return list(normalize_amenity_codes(all_amenities))
def merge_photos(supplier_data):
all_photos = []
for supplier_id, data in supplier_data.items():
photos = data.get('photos', [])
# Tag with source
for photo in photos:
photo['source'] = supplier_id
all_photos.extend(photos)
# Deduplicate by URL, prioritize high quality
return deduplicate_photos(all_photos)
Stage 5: Enrichment
Add data from secondary sources:
Reviews:
def enrich_with_reviews(master_id, hotel_name, city):
# Search TripAdvisor
ta_id = tripadvisor.search(name=hotel_name, city=city)
if ta_id:
reviews = tripadvisor.get_reviews(ta_id, limit=100)
rating = tripadvisor.get_rating(ta_id)
store_reviews(master_id, 'tripadvisor', reviews, rating)
# Search Google
google_place = google_places.search(name=hotel_name, city=city)
if google_place:
reviews = google_places.get_reviews(google_place.id)
rating = google_place.rating
store_reviews(master_id, 'google', reviews, rating)
Points of Interest:
def enrich_with_poi(master_id, lat, lon):
# Find nearby attractions
pois = google_places.nearby_search(
location=(lat, lon),
radius=1000, # 1km
types=['tourist_attraction', 'restaurant', 'transit_station']
)
for poi in pois:
distance = haversine(lat, lon, poi.lat, poi.lon)
store_poi(
master_id=master_id,
poi_name=poi.name,
poi_type=poi.type,
distance=distance
)
Stage 6: Publishing
Make aggregated data available:
Database:
def publish_hotel(master_id, merged_data, enrichments):
# Write to production database
db.hotels.update_one(
{'id': master_id},
{'$set': {
'name': merged_data['name'],
'address': merged_data['address'],
'location': merged_data['location'],
'stars': merged_data['stars'],
'amenities': merged_data['amenities'],
'photos': merged_data['photos'],
'reviews': enrichments['reviews'],
'nearby_poi': enrichments['poi'],
'updated_at': datetime.now()
}},
upsert=True
)
Search Index:
def index_hotel(master_id, merged_data):
# Index in Elasticsearch/Algolia for fast search
search_index.index(
id=master_id,
document={
'name': merged_data['name'],
'city': merged_data['address']['city'],
'country': merged_data['address']['country_code'],
'location': {
'lat': merged_data['location']['latitude'],
'lon': merged_data['location']['longitude']
},
'stars': merged_data['stars'],
'amenities': merged_data['amenities'],
'chain': merged_data.get('chain'),
# Searchable text
'search_text': f"{merged_data['name']} {merged_data['address']['city']}"
}
)
Cache:
def cache_hotel(master_id, merged_data):
# Cache frequently accessed hotels
cache.set(
key=f"hotel:{master_id}",
value=json.dumps(merged_data),
ttl=3600 # 1 hour
)
Challenges in Content Aggregation
1. Data Quality Variance
Different suppliers have vastly different data quality:
High-quality supplier:
- Complete address with postal code
- Accurate coordinates (validated)
- Rich amenity list (50+ items)
- High-resolution photos (20+)
Low-quality supplier:
- Address: "Paris" (just city name)
- No coordinates
- Amenities: "WIFI" (only one)
- No photos
Impact: Can't blindly merge. Need quality scoring:
def calculate_data_quality_score(hotel_data):
score = 0
# Address completeness
if hotel_data.get('postal_code'):
score += 10
if hotel_data.get('street'):
score += 10
# Location accuracy
if hotel_data.get('latitude') and hotel_data.get('longitude'):
score += 20
# Content richness
score += min(len(hotel_data.get('amenities', [])), 20)
score += min(len(hotel_data.get('photos', [])) * 2, 20)
# Description quality
if hotel_data.get('description'):
score += min(len(hotel_data['description']) / 50, 20)
return score # 0-100
Use quality scores to prioritize which supplier's data to use.
2. Update Frequency Variance
Suppliers update at different rates:
- Real-time: Direct connects, some GDS
- Daily: Most bedbanks
- Weekly: Some content-only providers
- Monthly: Small regional suppliers
Impact: Aggregated data can become stale. Strategy:
def get_freshness_weight(supplier_id, field):
last_update = get_last_update_time(supplier_id, field)
age_hours = (datetime.now() - last_update).total_seconds() / 3600
# Weight decreases with age
if age_hours < 24:
return 1.0 # Full weight
elif age_hours < 168: # 1 week
return 0.7
elif age_hours < 720: # 30 days
return 0.4
else:
return 0.1 # Very stale
3. Schema Evolution
Suppliers change their data formats:
- New fields added
- Old fields deprecated
- Data types change
- Validation rules change
Impact: Parsers break. Strategy:
# Version parsers
SUPPLIER_PARSERS = {
'supplier_a': {
'v1': SupplierAParserV1,
'v2': SupplierAParserV2,
}
}
def parse_supplier_data(supplier_id, data):
# Detect version
version = detect_schema_version(data)
# Use appropriate parser
parser_class = SUPPLIER_PARSERS[supplier_id][version]
parser = parser_class()
return parser.parse(data)
4. Rate Limiting and Quotas
Suppliers impose limits:
- API calls per minute
- Daily request quotas
- Concurrent connection limits
Impact: Can't refresh all data frequently. Strategy:
def prioritize_refresh(hotels):
# Prioritize high-traffic hotels
for hotel in hotels:
hotel['priority'] = calculate_refresh_priority(hotel)
hotels.sort(key=lambda h: h['priority'], reverse=True)
return hotels
def calculate_refresh_priority(hotel):
priority = 0
# Recent searches
priority += get_search_count(hotel['id'], days=7) * 10
# Recent bookings
priority += get_booking_count(hotel['id'], days=30) * 50
# Data staleness
age_hours = (datetime.now() - hotel['last_update']).total_seconds() / 3600
priority += age_hours
return priority
5. Geographic Coverage Gaps
No single supplier has global coverage:
- Supplier A: Strong in Europe, weak in Asia
- Supplier B: Strong in North America
- Supplier C: Strong in specific countries (e.g., Thailand, Indonesia)
Impact: Need multiple suppliers for global inventory.
Strategy: Track coverage metrics:
-- Coverage by region
SELECT
country_code,
COUNT(DISTINCT CASE WHEN 'supplier_a' = ANY(suppliers) THEN id END) as supplier_a_count,
COUNT(DISTINCT CASE WHEN 'supplier_b' = ANY(suppliers) THEN id END) as supplier_b_count,
COUNT(DISTINCT CASE WHEN 'supplier_c' = ANY(suppliers) THEN id END) as supplier_c_count,
COUNT(DISTINCT id) as total_hotels
FROM hotels
GROUP BY country_code
ORDER BY total_hotels DESC;
Best Practices
1. Store Raw Data
Always preserve original supplier data:
# Raw data table
CREATE TABLE supplier_raw_data (
id SERIAL PRIMARY KEY,
supplier_id VARCHAR(50),
supplier_hotel_id VARCHAR(100),
data JSONB,
ingested_at TIMESTAMP,
INDEX (supplier_id, supplier_hotel_id)
);
Why:
- Enables re-processing if normalization logic changes
- Debugging (why was this field set to X?)
- Compliance (prove where data came from)
2. Track Data Lineage
Know where each field came from:
{
"id": "MASTER_001",
"name": "Hilton Paris Opera",
"name_source": "supplier_b",
"address": "108 Rue Saint-Lazare, 75008 Paris",
"address_source": "supplier_a",
"coordinates": {"lat": 48.8761, "lon": 2.3266},
"coordinates_source": "supplier_b",
"photos": [
{"url": "...", "source": "supplier_a"},
{"url": "...", "source": "supplier_c"}
]
}
3. Implement Conflict Detection
Flag when suppliers disagree:
def detect_conflicts(supplier_data):
conflicts = []
# Check star rating agreement
star_ratings = [d['stars'] for d in supplier_data if d.get('stars')]
if len(set(star_ratings)) > 1:
conflicts.append({
'field': 'stars',
'values': star_ratings
})
# Check coordinate agreement (within 100m)
coords = [(d['lat'], d['lon']) for d in supplier_data if d.get('lat')]
for i, coord1 in enumerate(coords):
for coord2 in coords[i+1:]:
distance = haversine(*coord1, *coord2)
if distance > 0.1: # 100 meters
conflicts.append({
'field': 'coordinates',
'distance': distance
})
return conflicts
4. Automate Quality Checks
Continuously validate aggregated data:
def validate_aggregated_hotel(hotel):
errors = []
# Required fields
if not hotel.get('name'):
errors.append('Missing hotel name')
if not hotel.get('address'):
errors.append('Missing address')
# Data consistency
if hotel.get('stars') and not (0 <= hotel['stars'] <= 5):
errors.append(f'Invalid star rating: {hotel["stars"]}')
# Geographic validation
if hotel.get('location'):
lat, lon = hotel['location']['latitude'], hotel['location']['longitude']
city = hotel['address']['city']
country = hotel['address']['country_code']
expected_location = geocode(f"{city}, {country}")
distance = haversine(lat, lon, expected_location.lat, expected_location.lon)
if distance > 50: # 50km from city center seems wrong
errors.append(f'Coordinates {distance}km from {city}')
return errors
5. Monitor Supplier Health
Track supplier reliability:
# Metrics to track per supplier
SUPPLIER_METRICS = {
'supplier_a': {
'hotels_ingested_today': 12450,
'parse_error_rate': 0.02,
'avg_data_quality_score': 87,
'update_frequency_hours': 24,
'api_error_rate': 0.001
}
}
# Alert on anomalies
def check_supplier_health(supplier_id):
today = get_supplier_metrics(supplier_id, date=datetime.today())
baseline = get_supplier_metrics(supplier_id, date=datetime.today() - timedelta(days=7))
# Hotels dropped significantly?
if today['hotels_ingested_today'] < baseline['hotels_ingested_today'] * 0.8:
alert(f"{supplier_id}: Hotel count dropped {baseline - today}")
# Error rate spiked?
if today['parse_error_rate'] > baseline['parse_error_rate'] * 2:
alert(f"{supplier_id}: Parse error rate increased")
Using mapping.travel for Aggregation
Building content aggregation from scratch is complex. mapping.travel provides:
Pre-aggregated Database:
- 1M+ hotels from multiple suppliers
- Already matched and deduplicated
- Continuously updated
Aggregation API:
curl https://api.mapping.travel/v1/hotels?city=Paris&suppliers=all
Supplier Mapping Service:
curl -X POST https://api.mapping.travel/v1/match \
-d '{"name": "Hilton Paris Opera", "city": "Paris"}'
# Returns: master_id + all supplier IDs
Self-Hosted Option:
- Run aggregation pipeline on your infrastructure
- Full control over suppliers and logic
- Open-source matching engine
Get Started
Ready to aggregate hotel content?
- Try the API - Free tier available
- Explore the docs - Integration guides
- View the code - Open-source pipeline
Building comprehensive hotel inventory doesn't have to be hard. Let's aggregate together.
Questions about hotel content aggregation? Join our Discord community or email hello@mapping.travel.