Crossbeam finds overlaps in the data sets of disparate companies, which leads to an obvious problem: How do we decide when two records are a match? That's where our matching algorithm comes in.
The guiding force behind Crossbeam's matching algorithm is the concept of confidence. Because a match often results in data being shared, Crossbeam requires an extremely high level of confidence in order to consider two records as a "match."
In other words, false positives (when two unrelated records are incorrectly declared a match) are far worse than false negatives (when two matching records are incorrectly declared a non-match), and our algorithm is weighted as such.
Customers can not customize or modify the Crossbeam matching algorithm.
We compare multiple properties on any given record to develop a confidence score, but we place an extremely strong emphasis on properties that are unique to a given person or company. Let's explore a few data points that are important to matches.
Domain names are a source of high-confidence company matches, as no two companies can have the same domain. We run domain names through a standardization process to ensure that inconsistencies in formatting don't create false positives. We also maintain a growing awareness of cases where multiple domains are owned by the same company so that indirect matches can be made. Things to note about domain names:
- Crossbeam will strip out anything after the top level domain (TLD), i.e. google.com will match google.com/en
- Subdomains are not stripped out, and will not match the main domain alone, i.e. flights.google.com will not match google.com
- Capitalization and slashes do not matter, i.e. GoOgle.com will match google.com
- TLD differences (.com vs .net) will be treated as separate accounts, i.e. google.com will not match google.net
Email addresses are a source of high confidence person matches, as no two people can have the same email address. These addresses are run through a similar cleansing and standardization process as domains. Emails also have a bonus benefit of helping with company match resolution, as we can often determine that companies match based on them having matching people. When certain quality conditions are met, we can also us the domain name of contacts as a matching property for companies.
DUNS Numbers are a source of high confidence company matches, as these are unique to companies. While DUNS data is not always present on both sides of a data comparison, they can lead to highly accurate matching when present.
Real-World ("Meatspace") Names
Real-World ("Meatspace") Names (or, worse yet, ones that are similar) alone are a bad source of matches. Without a secondary characteristic to validate the match, a simple name-based comparison typically does not provide the confidence we need to make a match determination.
Our matching algorithm gets smarter every month, as the amount of training data and special situations we see increases. As such, you may see occasional minor shifts in the match rates between data sets. This is normal and is always associated with an increase in the quality of the matching methodology.