Record Matching for Hotel Data

10 min readJun 14, 2021

My thoughts based on the experiences on record linkage for hotel data

Source: Abstract vector created by starline

When data is fed into a system from multiple sources, it is essential to figure out duplicate data — or repeated occurrences of data — which ultimately is a determining factor on the quality of your system. If and when there are multiple data nodes representing the same entity — say hotels, for instance — with minor changes in their attributes (name, address, location, etc.), querying the system for such data nodes would yield all of such repeated data nodes, which would then be misleading for the systems which consumes the system in question. Modern technological hemisphere is thronged with high volumes and variance of data, and such occurrences may be very common. The fact that larger volumes of rapidly changing data affecting a multitude of industries is no longer a factor for surprise, and industries like Travel demand no exception.

Hotel Sources

If you own a hotel, you need sales. One way of achieving sales is listing your hotel(s) in an agency system like Expedia, and through their systems, the customers can reserve accommodations. This means you do not have to handle a complicated system directly with the customers — Expedia would do that for you.

Expedia is not the only agency that does this — there are many. Some of such systems are HotelBeds, AlphaTravel, etc. The operations are quite similar to that of Expedia. Now, think of a tour operator system which partners and aggregates all these systems into one place. That is, through a call center operative, customers can give their requirements and perform reservations through an intermediate proxy, and the intermediary system would connect to the partnered agency systems and return properties for user criteria.

Tour Operator Systems

As explained above, tour operator systems act as aggregated platforms to a multitude of hotels from multiple agency systems such as HotelBeds, Expedia and the like. Now, the number of hotels that each agency system carry vary, but a system like Expedia has over 1 million properties in their systems (properties include other properties apart from hotels as well). This would essentially mean that the number of hotels that the tour operators are dealing with is very large.

Now, if you own a hotel and you want sales for your hotel, you wouldn’t be satisfied with registering your property only in one such agency system. To maximize your revenue by enabling your hotel to be more exposed, you would register your properties in multiple other agencies as well. This yields a better outcome for you, but lies underneath a very complex problem for a tour operator: identifying the same hotel listed in multiple agencies.

The Problem

In most cases, there is no magic ID for properties which is global in Travel to determine whether two properties are the same. Although there are initiatives such as TTlcodes, aiming to solve this very issue, they are yet to be utilized in full by numerous organizations. Therefore, now, you need to use typical attributes in your data to discern between a multitude of properties. We can categorize such attributes to some broad arenas, and we’ll concern ourselves with hotel properties from here on.

Nomenclature-related attributes: Each hotel has a name. But note that this name is not unique.
Location-related attributes: The position of the hotel, reflected by its address, position metrics (latitude, longitude) and the like. These may not be unique as well, given that how each agency system represent addresses may differ, and the lat/lon combination may not be 100% accurate in each system.
Amenity-related attributes: Each hotel has its own amenities (facilities). These are far from unique.
Other attributes: These include contact information, star rating (4-star, 5-star, etc.) and the like.

It seems that there are enough groups of attributes for a system to distinguish two hotels. And that’s actually true — it is not too difficult — in most cases — to discern whether two hotels are different. What’s difficult is figuring out whether two hotel entities are the same. The fact that names of two hotels being drastically different is a precursor to discern that the two hotels in question are two different entities is trivial. But non-trivial issues arise when the amount of the two names being different lies within a thin margin. That is, when the difference of the names is minute.

Quantifying differences between strings

Let us say that we need to calculate the difference between the two strings “Hotel California, United States” and “California Peek Hotel, United Kingdom”. We’ll try to extract the gist out of an algorithm which is used to quantify the difference in question.

By looking at the two strings, we can identify one characteristic that the algorithm should have: that is, it should have a notion about the common word count between the two strings. When the common word count is higher, it is safe to estimate (although not entirely) that the two strings are extremely likely to be the same, and vice versa. Another such characteristic would be the length of the two strings.

Using these notions — and a some of others — people have developed many algorithms to quantify this difference. Such algorithms include cosine difference, Levenstein distance, N-Gram distance, Jaro-Winkler algorithm, etc. It is important here to note that each algorithm has its own strengths and caveats, and therefore should be used with care following a thorough analysis.

Noise & Preprocessing

Data at large is always imperfect. What’s more disturbing is that it is extremely difficult to identify some kinds of noise. Noise can be due to many factors in hotel corpuses:

Human error: Book-keeping errors, calibration errors
External noise: Noisy data picked up from different sources (APIs, book-keeping data)
False information: Wrong data
Corrupted Data: Data which were of quality but later degraded
Incomplete data

Although it may not be possible to identify every occurrence of noise, there are still methods of mitigating this issue to a certain extent. One way is to use only a narrow subset of data which is essential to guide the algorithm, and drop the rest. Stop-word removal is one such method: in this method, we remove commonly used words from strings which does not aid in distinguishing between strings. Common words such as “Hotel”, “Villa”, “Resort”, hotel group tags (“Emirates”, “Hilton”, “Marriott”, etc.) which are more domain-related, and language stop words such as “a”, “the”, “an” etc. can be removed prior to the algorithm execution in a preprocessing stage. However, it is important to emphasize here that when it comes to mitigating the problem of noise, a 100% automation would not help in getting an accurate result — one may anyway need some form of a manual intervention to clean up the data, at least in the early stages which later can be automated after figuring out the patterns (e.g. one such method would be to use frequency-based stop-word removal, where the whole hotel name corpus is indexed and counted, and terms with relatively higher frequencies are removed).

Same name, different locations

Resolving two entities only with nomenclature is impossible for the most part, certainly when it comes to hotel data. Human thought is not diverse as early philosophers though: the statistical likelihood of two suppliers jumping to the same name for their properties is too high. One way of resolving this, as a record linkage problem, is to use location-related attributes in addition. However, different agencies have their own location hierarchies through which they assign locations to properties listed in their systems. Locations can be reflected by a nomenclature attribute (city name, state name, and country name) or numerical attribute such as a lat/lon combination. The former is problematic due to different representation of location nomenclature: the city one lives in can be near another city, and it is not uncommon for agency systems to attribute such nearby cities as the location of a particular hotel.

Resolving this problem is a subject for another article: its breadth is too wide for be covered in this article. However, the main idea is, through a clever algorithm which uses common string-matching techniques and common location data, one can prepare a map (with many-to-one cardinality) which assigns each city coming from different agency system to one generic location which is maintained by the record linkage system (i.e. tour operator system).

Hence, in prior to the record linkage process, one can replace agency hotel location details with the generic locations, and use the generic locations as pivot points to prepare the pairs that are required for the matching process. This blocking criteria (i.e. pair generation) is based on the premise that for two hotels to be matched, at least they should be in the same generic location. That is, preparing pairs of hotels in the same city is a viable strategy which then feeds the pairs into the algorithm to match other attributes such as hotel names.

Machine Learning

As we’ve seen above, each hotel entity is paired, and a pair acts as a single input for our matching algorithm which scores each input. Each score can be normalized to [0, 1] range, where 1 represents a perfect match. The algorithm itself may be either a fuzzy rule-based algorithm or a machine learning one — depends on your data. What’s different is that for a fuzzy algorithm, you set threshold levels (cut-off levels) in prior. That is, if we consider only the hotel names, we may instill a rule that a perfect match would represent those pairs which has name difference more than 0.95 — these values/thresholds are empirically determined. However, in machine learning, we let the algorithm figure these parameters out itself by feeding it (training) a multitude of data. This data has to cover almost every ground that our input data may have to cover: that is to say that the training data needs to have as many intricacies that may be possible to be there in input data. For example, the machine learning algorithm (model) needs to be trained (or “taught”) on noisy data, incomplete data, and the like for an accurate output. Common algorithms such as decision trees (and ensembles of trees such as forests), support vector machines, neural networks can be experimented with. Choosing the right model is always empirical.

False Positives

Any machine learning algorithm is not 100% accurate. Even as humans, our decisions aren’t correct in their entirety (c’est la vie). What would happen if a pair which consists of two hotel entities which are very close to being similar (but physically aren’t) is determined as the same entity by the algorithm?

The impact is majorly financial. A false positive would, if we go back to the basics, would falsely distinguish those two hotels (from two sources) are the same. That is, the tour operators would represent in their systems, the two hotels as a single entity provided by one supplier.

If S (supplier created at tour operator system) was initially created using X (assume X and X’ are two hotels which falsely returned as a match from our algorithm) hotel’s data, this would mean that reserving a room in X’ is the same as reserving a room in X in tour operator system!

To mitigate such false positives, it is imperative to have a fuzzy system in front of the core algorithm, so that it filters out incorrect results coming from the core. Otherwise, the financial repercussions would be devastating. In addition, for the uncertain outputs, manual intervention might be necessary. Based on certain thresholds of the fuzzy system(s) the outputs can be categorized into levels such as correct matches, possible matches, and the like. According to these criteria, one can design a technique for manual intervention.

Conclusion

Record linkage problems tend to have the following steps:

Data collection.
Preprocessing
Pair generation.
Feature extraction.
Matching.
False-positive mitigation.

In this article, I’ve tried to give a very narrow walkthrough of how this process can be applied to a very real problem in travel domain. The article was purposely made un-exhaustive of its technicality, due to the depth it can delve down in many areas of academic research (natural language processing, machine learning, mathematics, and the like). I’ve also purposely by-passed some non-functional technicalities such as dealing with large volumes of data, which I have covered in other articles, because this article largely focuses on the main functional problem. It is a very real concern, however, that such a process is hefty in terms of both memory and computation, and swift measures are required to mitigate such issues. However, it must be conveyed that there always exists a trade-off between accuracy and performance (specially speed). The opportunity cost of a highly accurate algorithm would be the speed of its execution. In this situation, such is the case, where the financial implications would be devastating if false positives occur. Thus, it may be required to sacrifice speed for an accurate result.

Overall, the accuracy of the process largely depends on the number and the quality of attributes you choose for hotel entities to be discerned by the algorithm, and the type of the algorithm itself. Apart from the two attributes I’ve discussed out of four categories I’ve presented, there can be other non-trivial attributes as well. But in prior to using the attributes, a quality analysis may have to be carried out for those attributes taking a sufficient corpus of real data to understand the variance of such data. This analysis ultimately determines your pre-processing procedures, and the core matching algorithm, and indirectly, the quality of your output.