How Our AI-Powered Hotel Matching Works
By Engineering Team
Hotel name matching is a challenging problem. Different sources use different naming conventions, abbreviations, and translations. Our matching system uses a sophisticated two-stage approach to achieve over 92% accuracy.
Stage 1: Fuzzy Candidate Retrieval
The first stage uses rapidfuzz for fast fuzzy string matching. We retrieve the top candidates based on:
- Token-based matching - Handles word order differences ("Hilton Paris" vs "Paris Hilton Hotel")
- Geographic distance - When coordinates are available, nearby hotels are prioritized
- City name matching - Filters candidates by location
This stage is optimized for speed, processing thousands of hotels per second using PostgreSQL's trigram indexes.
Stage 2: Semantic Reranking
The second stage uses BGE-Reranker-Large, a cross-encoder model that understands semantic similarity. Unlike traditional fuzzy matching, it can:
- Understand that "Hilton" and "Hilton Hotels" are the same chain
- Recognize abbreviations ("NYC" = "New York City")
- Handle translations and transliterations
- Distinguish between similar but different hotels
Score Calibration
Raw model scores aren't directly interpretable. We use Platt calibration (logistic regression) to convert scores into probabilities. This allows us to:
- Set meaningful thresholds (0.54 for match/no-match)
- Provide confidence levels you can trust
- Optimize for precision/recall based on your needs
Benchmarks
Our system achieves:
| Metric | Score |
|---|---|
| F1 Score | 0.95 |
| Precision | 0.95 |
| Recall | 0.95 |
Tested on a diverse dataset of 10,000 hotel pairs across multiple languages and regions.
Alternative Models
We've benchmarked several models:
| Model | Best F1 | Threshold |
|---|---|---|
| Mapping production model | 0.95 | 0.54 |
| bge_reranker_v2_m3 | 0.9188 | 0.48 |
| gte_multilingual_reranker_base | 0.9119 | 0.48 |
| ms_marco_minilm_l6_v2 | 0.8936 | 0.66 |
Our production model — a fine-tuned reranker on top of BGE-Reranker-Large — provides the best balance of accuracy and speed for hotel property matching.