Back to Blog
EngineeringJanuary 23, 202610 min read

Understanding Hotel Content Aggregation

By Engineering Team

Hotel content aggregation is the process of collecting, combining, and presenting hotel information from multiple data sources. It's fundamental to building any travel platform that offers comprehensive hotel inventory.

Here's how it works, why it's challenging, and how to do it right.

What Is Hotel Content Aggregation?

At its core, content aggregation means:

  1. Collecting data from multiple suppliers
  2. Matching hotels across suppliers (same hotel, different IDs)
  3. Merging data into a unified representation
  4. Enriching with additional data sources
  5. Serving via your application or API

Example: Building a hotel search for Paris

Without aggregation:

  • Supplier A: 2,000 hotels
  • Supplier B: 1,500 hotels
  • Supplier C: 1,800 hotels
  • Your platform: 2,000 hotels (just from Supplier A)

With aggregation:

  • Your platform: 3,200 unique hotels (combining all suppliers, deduplicating)

Users get more choice, better prices, and higher availability.

Data Sources for Hotel Content

Primary Sources (Inventory Suppliers)

Bedbanks:

  • HotelBeds
  • Tourico
  • WebBeds
  • GTA
  • Travco

GDS (Global Distribution Systems):

  • Amadeus
  • Sabre
  • Travelport (Apollo, Galileo, Worldspan)

Direct Connects:

  • Marriott API
  • Hilton OnQ
  • IHG Connect
  • Accor D-Edge

OTA Affiliate Networks:

  • Booking.com Partner Solutions
  • Expedia Partner Solutions (EPS)
  • Agoda Affiliate Program

Secondary Sources (Content Enrichment)

Review Data:

  • TripAdvisor Content API
  • Google Places API
  • Trustpilot
  • Scraped reviews (legal considerations apply)

Photos and Media:

  • Unsplash
  • Pexels
  • Supplier-provided photos
  • User-generated content

Points of Interest:

  • Google Places
  • Foursquare
  • OpenStreetMap

Pricing Intelligence:

  • OTA Insight
  • RateGain
  • Public rate shopping

The Aggregation Pipeline

Stage 1: Data Ingestion

Collect data from each supplier:

API Pulling:

def ingest_supplier_api(supplier_id):
    client = get_supplier_client(supplier_id)
    
    # Fetch hotel listings
    hotels = client.get_hotels(
        updated_since=last_sync_time,
        page_size=1000
    )
    
    for hotel in hotels:
        store_raw_data(
            supplier_id=supplier_id,
            hotel_id=hotel['id'],
            data=hotel,
            timestamp=datetime.now()
        )

File Drops (FTP/SFTP):

def ingest_supplier_ftp(supplier_id):
    ftp = connect_to_supplier_ftp(supplier_id)
    
    # Download latest file
    latest_file = ftp.get_latest('hotels_*.xml')
    
    # Parse XML
    hotels = parse_xml(latest_file)
    
    # Store each hotel
    for hotel in hotels:
        store_raw_data(...)

Database Replication:

def ingest_supplier_db(supplier_id):
    # Replicate from supplier's database
    source_db = get_supplier_db_connection(supplier_id)
    
    # Incremental sync
    new_hotels = source_db.query("""
        SELECT * FROM hotels
        WHERE updated_at > %s
    """, last_sync_time)
    
    for hotel in new_hotels:
        store_raw_data(...)

Stage 2: Normalization

Transform supplier-specific formats into a canonical schema:

def normalize_hotel(raw_data, supplier_id):
    parser = get_supplier_parser(supplier_id)
    
    # Parse supplier-specific format
    parsed = parser.parse(raw_data)
    
    # Normalize fields
    normalized = {
        'name': normalize_name(parsed.name),
        'address': normalize_address(parsed.address),
        'location': normalize_coordinates(parsed.lat, parsed.lon),
        'stars': normalize_star_rating(parsed.stars),
        'amenities': [normalize_amenity(a) for a in parsed.amenities],
        'chain': detect_chain(parsed.name)
    }
    
    return normalized

See our detailed guide: Why Hotel Data Normalization Matters

Stage 3: Matching and Deduplication

Identify which hotels from different suppliers are the same property:

def match_hotel(normalized_hotel):
    # Find candidates using fuzzy matching
    candidates = fuzzy_search(
        name=normalized_hotel['name'],
        city=normalized_hotel['address']['city'],
        limit=10
    )
    
    # Rerank using semantic model
    matches = rerank_candidates(
        query=normalized_hotel,
        candidates=candidates
    )
    
    # Get best match above threshold
    best_match = matches[0]
    if best_match.score > 0.70:
        return best_match.master_id
    
    # No match found, create new master record
    return create_new_master_hotel(normalized_hotel)

See our detailed guide: How Our AI-Powered Hotel Matching Works

Stage 4: Data Merging

Combine data from multiple suppliers for the same hotel:

Conflict Resolution Strategy:

Different suppliers may provide conflicting information:

Hotel: Hilton Paris Opera (MASTER_001)

Supplier A: 4 stars
Supplier B: 4 stars
Supplier C: 5 stars  ← Outlier

→ Resolution: Use mode (4 stars)

Field Prioritization:

# Define which supplier is most reliable for each field
FIELD_PRIORITY = {
    'name': ['direct_connect', 'gds', 'bedbank'],
    'address': ['gds', 'direct_connect', 'bedbank'],
    'coordinates': ['gds', 'bedbank', 'direct_connect'],
    'photos': ['direct_connect', 'bedbank', 'gds'],
    'amenities': ['direct_connect', 'gds', 'bedbank']
}

def merge_hotel_data(master_id):
    # Get all supplier data for this hotel
    supplier_data = get_all_supplier_data(master_id)
    
    merged = {}
    
    for field in FIELDS:
        # Try suppliers in priority order
        for supplier_type in FIELD_PRIORITY[field]:
            value = get_field_from_supplier(supplier_data, supplier_type, field)
            if value is not None:
                merged[field] = value
                break
    
    return merged

Content Combination:

For lists (amenities, photos), combine from all suppliers:

def merge_amenities(supplier_data):
    all_amenities = set()
    
    for supplier_id, data in supplier_data.items():
        all_amenities.update(data.get('amenities', []))
    
    # Deduplicate and normalize
    return list(normalize_amenity_codes(all_amenities))

def merge_photos(supplier_data):
    all_photos = []
    
    for supplier_id, data in supplier_data.items():
        photos = data.get('photos', [])
        # Tag with source
        for photo in photos:
            photo['source'] = supplier_id
        all_photos.extend(photos)
    
    # Deduplicate by URL, prioritize high quality
    return deduplicate_photos(all_photos)

Stage 5: Enrichment

Add data from secondary sources:

Reviews:

def enrich_with_reviews(master_id, hotel_name, city):
    # Search TripAdvisor
    ta_id = tripadvisor.search(name=hotel_name, city=city)
    if ta_id:
        reviews = tripadvisor.get_reviews(ta_id, limit=100)
        rating = tripadvisor.get_rating(ta_id)
        
        store_reviews(master_id, 'tripadvisor', reviews, rating)
    
    # Search Google
    google_place = google_places.search(name=hotel_name, city=city)
    if google_place:
        reviews = google_places.get_reviews(google_place.id)
        rating = google_place.rating
        
        store_reviews(master_id, 'google', reviews, rating)

Points of Interest:

def enrich_with_poi(master_id, lat, lon):
    # Find nearby attractions
    pois = google_places.nearby_search(
        location=(lat, lon),
        radius=1000,  # 1km
        types=['tourist_attraction', 'restaurant', 'transit_station']
    )
    
    for poi in pois:
        distance = haversine(lat, lon, poi.lat, poi.lon)
        store_poi(
            master_id=master_id,
            poi_name=poi.name,
            poi_type=poi.type,
            distance=distance
        )

Stage 6: Publishing

Make aggregated data available:

Database:

def publish_hotel(master_id, merged_data, enrichments):
    # Write to production database
    db.hotels.update_one(
        {'id': master_id},
        {'$set': {
            'name': merged_data['name'],
            'address': merged_data['address'],
            'location': merged_data['location'],
            'stars': merged_data['stars'],
            'amenities': merged_data['amenities'],
            'photos': merged_data['photos'],
            'reviews': enrichments['reviews'],
            'nearby_poi': enrichments['poi'],
            'updated_at': datetime.now()
        }},
        upsert=True
    )

Search Index:

def index_hotel(master_id, merged_data):
    # Index in Elasticsearch/Algolia for fast search
    search_index.index(
        id=master_id,
        document={
            'name': merged_data['name'],
            'city': merged_data['address']['city'],
            'country': merged_data['address']['country_code'],
            'location': {
                'lat': merged_data['location']['latitude'],
                'lon': merged_data['location']['longitude']
            },
            'stars': merged_data['stars'],
            'amenities': merged_data['amenities'],
            'chain': merged_data.get('chain'),
            # Searchable text
            'search_text': f"{merged_data['name']} {merged_data['address']['city']}"
        }
    )

Cache:

def cache_hotel(master_id, merged_data):
    # Cache frequently accessed hotels
    cache.set(
        key=f"hotel:{master_id}",
        value=json.dumps(merged_data),
        ttl=3600  # 1 hour
    )

Challenges in Content Aggregation

1. Data Quality Variance

Different suppliers have vastly different data quality:

High-quality supplier:

  • Complete address with postal code
  • Accurate coordinates (validated)
  • Rich amenity list (50+ items)
  • High-resolution photos (20+)

Low-quality supplier:

  • Address: "Paris" (just city name)
  • No coordinates
  • Amenities: "WIFI" (only one)
  • No photos

Impact: Can't blindly merge. Need quality scoring:

def calculate_data_quality_score(hotel_data):
    score = 0
    
    # Address completeness
    if hotel_data.get('postal_code'):
        score += 10
    if hotel_data.get('street'):
        score += 10
    
    # Location accuracy
    if hotel_data.get('latitude') and hotel_data.get('longitude'):
        score += 20
    
    # Content richness
    score += min(len(hotel_data.get('amenities', [])), 20)
    score += min(len(hotel_data.get('photos', [])) * 2, 20)
    
    # Description quality
    if hotel_data.get('description'):
        score += min(len(hotel_data['description']) / 50, 20)
    
    return score  # 0-100

Use quality scores to prioritize which supplier's data to use.

2. Update Frequency Variance

Suppliers update at different rates:

  • Real-time: Direct connects, some GDS
  • Daily: Most bedbanks
  • Weekly: Some content-only providers
  • Monthly: Small regional suppliers

Impact: Aggregated data can become stale. Strategy:

def get_freshness_weight(supplier_id, field):
    last_update = get_last_update_time(supplier_id, field)
    age_hours = (datetime.now() - last_update).total_seconds() / 3600
    
    # Weight decreases with age
    if age_hours < 24:
        return 1.0  # Full weight
    elif age_hours < 168:  # 1 week
        return 0.7
    elif age_hours < 720:  # 30 days
        return 0.4
    else:
        return 0.1  # Very stale

3. Schema Evolution

Suppliers change their data formats:

  • New fields added
  • Old fields deprecated
  • Data types change
  • Validation rules change

Impact: Parsers break. Strategy:

# Version parsers
SUPPLIER_PARSERS = {
    'supplier_a': {
        'v1': SupplierAParserV1,
        'v2': SupplierAParserV2,
    }
}

def parse_supplier_data(supplier_id, data):
    # Detect version
    version = detect_schema_version(data)
    
    # Use appropriate parser
    parser_class = SUPPLIER_PARSERS[supplier_id][version]
    parser = parser_class()
    
    return parser.parse(data)

4. Rate Limiting and Quotas

Suppliers impose limits:

  • API calls per minute
  • Daily request quotas
  • Concurrent connection limits

Impact: Can't refresh all data frequently. Strategy:

def prioritize_refresh(hotels):
    # Prioritize high-traffic hotels
    for hotel in hotels:
        hotel['priority'] = calculate_refresh_priority(hotel)
    
    hotels.sort(key=lambda h: h['priority'], reverse=True)
    return hotels

def calculate_refresh_priority(hotel):
    priority = 0
    
    # Recent searches
    priority += get_search_count(hotel['id'], days=7) * 10
    
    # Recent bookings
    priority += get_booking_count(hotel['id'], days=30) * 50
    
    # Data staleness
    age_hours = (datetime.now() - hotel['last_update']).total_seconds() / 3600
    priority += age_hours
    
    return priority

5. Geographic Coverage Gaps

No single supplier has global coverage:

  • Supplier A: Strong in Europe, weak in Asia
  • Supplier B: Strong in North America
  • Supplier C: Strong in specific countries (e.g., Thailand, Indonesia)

Impact: Need multiple suppliers for global inventory.

Strategy: Track coverage metrics:

-- Coverage by region
SELECT
  country_code,
  COUNT(DISTINCT CASE WHEN 'supplier_a' = ANY(suppliers) THEN id END) as supplier_a_count,
  COUNT(DISTINCT CASE WHEN 'supplier_b' = ANY(suppliers) THEN id END) as supplier_b_count,
  COUNT(DISTINCT CASE WHEN 'supplier_c' = ANY(suppliers) THEN id END) as supplier_c_count,
  COUNT(DISTINCT id) as total_hotels
FROM hotels
GROUP BY country_code
ORDER BY total_hotels DESC;

Best Practices

1. Store Raw Data

Always preserve original supplier data:

# Raw data table
CREATE TABLE supplier_raw_data (
    id SERIAL PRIMARY KEY,
    supplier_id VARCHAR(50),
    supplier_hotel_id VARCHAR(100),
    data JSONB,
    ingested_at TIMESTAMP,
    INDEX (supplier_id, supplier_hotel_id)
);

Why:

  • Enables re-processing if normalization logic changes
  • Debugging (why was this field set to X?)
  • Compliance (prove where data came from)

2. Track Data Lineage

Know where each field came from:

{
  "id": "MASTER_001",
  "name": "Hilton Paris Opera",
  "name_source": "supplier_b",
  "address": "108 Rue Saint-Lazare, 75008 Paris",
  "address_source": "supplier_a",
  "coordinates": {"lat": 48.8761, "lon": 2.3266},
  "coordinates_source": "supplier_b",
  "photos": [
    {"url": "...", "source": "supplier_a"},
    {"url": "...", "source": "supplier_c"}
  ]
}

3. Implement Conflict Detection

Flag when suppliers disagree:

def detect_conflicts(supplier_data):
    conflicts = []
    
    # Check star rating agreement
    star_ratings = [d['stars'] for d in supplier_data if d.get('stars')]
    if len(set(star_ratings)) > 1:
        conflicts.append({
            'field': 'stars',
            'values': star_ratings
        })
    
    # Check coordinate agreement (within 100m)
    coords = [(d['lat'], d['lon']) for d in supplier_data if d.get('lat')]
    for i, coord1 in enumerate(coords):
        for coord2 in coords[i+1:]:
            distance = haversine(*coord1, *coord2)
            if distance > 0.1:  # 100 meters
                conflicts.append({
                    'field': 'coordinates',
                    'distance': distance
                })
    
    return conflicts

4. Automate Quality Checks

Continuously validate aggregated data:

def validate_aggregated_hotel(hotel):
    errors = []
    
    # Required fields
    if not hotel.get('name'):
        errors.append('Missing hotel name')
    if not hotel.get('address'):
        errors.append('Missing address')
    
    # Data consistency
    if hotel.get('stars') and not (0 <= hotel['stars'] <= 5):
        errors.append(f'Invalid star rating: {hotel["stars"]}')
    
    # Geographic validation
    if hotel.get('location'):
        lat, lon = hotel['location']['latitude'], hotel['location']['longitude']
        city = hotel['address']['city']
        country = hotel['address']['country_code']
        
        expected_location = geocode(f"{city}, {country}")
        distance = haversine(lat, lon, expected_location.lat, expected_location.lon)
        
        if distance > 50:  # 50km from city center seems wrong
            errors.append(f'Coordinates {distance}km from {city}')
    
    return errors

5. Monitor Supplier Health

Track supplier reliability:

# Metrics to track per supplier
SUPPLIER_METRICS = {
    'supplier_a': {
        'hotels_ingested_today': 12450,
        'parse_error_rate': 0.02,
        'avg_data_quality_score': 87,
        'update_frequency_hours': 24,
        'api_error_rate': 0.001
    }
}

# Alert on anomalies
def check_supplier_health(supplier_id):
    today = get_supplier_metrics(supplier_id, date=datetime.today())
    baseline = get_supplier_metrics(supplier_id, date=datetime.today() - timedelta(days=7))
    
    # Hotels dropped significantly?
    if today['hotels_ingested_today'] < baseline['hotels_ingested_today'] * 0.8:
        alert(f"{supplier_id}: Hotel count dropped {baseline - today}")
    
    # Error rate spiked?
    if today['parse_error_rate'] > baseline['parse_error_rate'] * 2:
        alert(f"{supplier_id}: Parse error rate increased")

Using mapping.travel for Aggregation

Building content aggregation from scratch is complex. mapping.travel provides:

Pre-aggregated Database:

  • 1M+ hotels from multiple suppliers
  • Already matched and deduplicated
  • Continuously updated

Aggregation API:

curl https://api.mapping.travel/v1/hotels?city=Paris&suppliers=all

Supplier Mapping Service:

curl -X POST https://api.mapping.travel/v1/match \
  -d '{"name": "Hilton Paris Opera", "city": "Paris"}'
# Returns: master_id + all supplier IDs

Self-Hosted Option:

  • Run aggregation pipeline on your infrastructure
  • Full control over suppliers and logic
  • Open-source matching engine

Get Started

Ready to aggregate hotel content?

Building comprehensive hotel inventory doesn't have to be hard. Let's aggregate together.


Questions about hotel content aggregation? Join our Discord community or email hello@mapping.travel.