It’s not John, it’s James. In the US alone, it is estimated there are over 30,000 people who share the same name, James Smith. In Korea, almost 20% of the population – some 10 million people – share the same family name of Kim. The world is also home to over 150 million with the same given name – Mohamed. Cases of mistaken identity are common, particularly when searching over large volumes of data, but they needn’t be.
Centuries of tradition and culture have given us an eclectic mix of ways that we refer to one another. Names can reflect our familial ties, which generation we were born into, who our ancestors were, our clan, or may even indicate a union of two families. They are a part, not just of our heritage, but of our identity. This rich diversity however, was not intended for the information age, where electronic records, transactions and communications often require a global, unique and unambiguous identity to be resolved. IP addresses may work well to uniquely identify devices on the global internet, but humans still require something more… human.
The stakes are high. Almost all investigatory work, whether in law enforcement, counter terrorism or within the anti-money laundering (AML) and due diligence processes of a bank, require accurate ways of searching and discovering specific entities in large data sets. However, poor record keeping, missing or incomplete data and legacy matching-logic hamper these efforts. False positive matches – selecting the wrong entity – and worse, false negatives (where a critical search result is missed altogether) are abundant.
When searching large datasets for names or organisations, ‘entity resolution’ refers to data analytics that aim to uniquely resolve data – often across many different sources – to a real-world entity.
Any example makes clear the benefits of this. Our collection of James Smiths could be resolved by utilising other details in the data. Email addresses, dates of birth and postcodes are common attributes that help systems disambiguate or join the dots between multiple records about the same person such that results are returned for the specific individual and not their namesake. Companies suffer from the same ambiguity too – nearly three thousand companies are registered with UK Companies House starting with the word “Sigma”, but using amplifying information such as address, phone number, company registration date, or any other feature of the record helps entity resolution technology narrow down the data to ensure that decisions are made with only the desired, and not unintended effect.
Most importantly to regulators, the global programme of international sanctions enforced by the US, EU, UK and almost every other country relies on high quality entity resolution. When GRU officer Yuriy Sergeyevich Andrienko was charged in connection with worldwide crimes in cyberspace, his name was added to many international watchlists. However, this name may be rendered not only in the latin script above, but in its native form in the Cyrillic alphabet – Юрий Сергеевич Андриенко. It may be abbreviated, or re-ordered, or simply misspelled. So to ensure that the sanction is effectively implemented, that records are not missed and that other people with similar names are not inadvertently punished, large amounts of analysts’ time are spent ensuring that poor quality alerts are fully assessed.
Key Challenges for Entity Resolution
Entity resolution can be a powerful enabling technology that can underpin anti-money laundering and counter-terrorism programmes. In its most rudimentary form it has existed for many years with deep limitations. However, new technology such as artificial intelligence means it is an area that is rapidly evolving. We see five key challenges for data scientists to overcome to create more efficient and effective systems for countering money laundering and terrorism:
Joining automatically between structured and unstructured data – The power of entity resolution is limited when it is only able to process data from structured records such as client records, watchlists, spreadsheets and other data formatted for machines. However, perhaps more than 90% of the world’s data is unstructured, meaning vital insights may be missed. When searching for “James Smith”, modern entity resolution technology needs to ensure that data sources such as news articles, websites and other notes also included, linking names as they appear adverse news (articles about corruption, bribery, fraud, terrorism or any other predicate offence) for instance, with names as they appear on watchlists. Natural Language Processing (NLP) is a field of computing that allows the automated analysis of large amounts of text content. It increasingly makes use of machine learning to allow computers to understand the intricate patterns and subtle semantics of human language by learning from the seemingly limitless quantities of text found on the internet. NLP can make sense of unstructured data and extract entities across multiple languages and dialects, which is essential in order to identify and link records wherever they may appear.
Matching names in new ways – Not only are names not globally unique, there is also no standard way of rendering them. Thus, James Smith can be Jim Smith, J Smith, J M Smith, as well as a huge array of possible typos, transpositions, aliases, or renderings in different dialects, alphabets and scripts. Matching against “exact hit” names works when data quality is very high, but it means there are no alerts at all if names have even the slightest variation, increasing the chances of criminals slipping through the net. Similarly, so-called “fuzzy matching” which will alert if one or two characters are different, still cannot account for the sheer variety and array of cultural nuances in how names are rendered in different types of data. The solution is to use data to drive a new type of matching logic. Technology such as that developed by Ripjar uses observations from millions of names, deriving matching logic from how the name is used in real-world situations.
Relationship Linking – No person is an island, and the relationships that an entity has with others give important context to analysts and investigators. Entities may relate to one another in a familial sense (father, brother, mother), or in the context of a business (owner, shareholder, person of significant control), or their location or address. Identifying these relationships vastly increases the likelihood that the person being searched for is correctly selected by the system, but many legacy systems do not extract relationships from the variety of data needed to give a complete and accurate picture – especially unstructured data. Extracting relationships at scale allows vast “Knowledge Graphs” to be built which can dramatically improve decision making and Entity resolution, providing a way of quickly analysing many different questions, from a single joined-up picture of entities and how they relate to one another.
Security and Privacy – The power of entity resolution means it must be governed appropriately. Processing personal data and connecting records effectively means safeguarding the privacy and security of those customers who place their trust in the institutions that administer financial systems or government agencies. Entity resolution systems therefore must also become tightly integrated with wider audit and data governance strategies – if entity records from two distinct datasets or systems become linked through smart logic, then the resultant resolved entity must inherit the security regime of each dataset that contributed. This means policies at the national or international level can be adhered to at all times, without compromising the effectiveness of the data analytics.
Evolving Understanding of Identity – Real data is not just messy and incomplete, but it also evolves over time with new facts being added, or incorrect facts removed. Sometimes the addition or removal of a new strong identifying fact, for example a Social Security Number or a Passport Number can cause a new match to be made or, indeed, a previous match needing to be undone. To do this, entity resolution processes must store the history of matches and merges such that they can be undone in the light of new evidence which makes the previous assumption to be incorrect. Reconsidering the best possible match on seeing a new or updated piece of data also allows for the system to provide the same results regardless of the order that data is played into it. It is crucial that an entity resolution system is able to evolve to accommodate a changing landscape and correctly handle the uncertainty in the decisions it makes.
Entity Resolution is an essential capability in the fight against financial crime, fraud and terrorism. By improving the quality of the data that is used to make decisions such as enforcing international sanctions or alerting to possible corruption or fraud, it can dramatically improve the effectiveness and efficiency of human analysts and allow small teams to scale investigations to the demands of the modern information environment.
Combining recent work in entity resolution and NLP means that analysts can now see the complete picture across structured and unstructured data, and data-driven approaches to name matching covering transliterations, scripts and other real-world name variants can give 90% more accuracy than legacy “fuzzy matching” technology. Robust data privacy controls mean interconnected graphs of knowledge, resolving entities from all available data sources can be now built without compromising user privacy or data protection.
If you would like to know more about Ripjar’s approach and how we have helped global institutions roll out breakthrough innovations in entity resolution to support their counter-financial crime programmes, please download the whitepaper or get in touch with the team here.