What are entity extraction, entity disambiguation, and entity linking?
In our last post, we discussed some of the powerful capabilities of entity graphs. Entity graphs model contextual relevant relationships between an organization’s entities from unstructured data sources. Like knowledge graphs, entity graphs are comprehensive, automatically curated, and cross-linked to documents.
But how do you build entity graphs accurately? For example, say a customer opens a support ticket talking about their “iPhone”. Do you know if that’s an “iPhone 15 Pro Max” or an “iPhone 14 Pro Max”?
Building entity graphs accurately requires three processes: entity extraction, entity disambiguation, and entity linking. Let’s look at why each of these is important to building a strong entity graph.
The first step in building an entity graph is discovering the entities in context - i.e., discovering them in unstructured text. This is entity extraction, also called Named Entity Extraction (NER).
An entity can be any noun or noun-phrase that is the name of something. They can be people, places, organizations, products, and more.
Like we said, however, extraction is only the first step. The next step involves associating the extracted entity with authoritative database records of relevant entities. This is entity linking or entity disambiguation.
This linking/disambiguation is the challenging part. NER is a known problem with several capable solutions available today.. For example, given a text like “Adam went to Cairo to visit his family”, an extractor can identify that “Adam” is an entity of type Person and “Cairo” is an entity of type Location.
The more challenging part of the problem comes in determining who “Adam” is. Which of the 100s of Adams could this be within the company?! Associating this entity incorrectly to authoritative records can produce noise that makes it hard, if not impossible, for analysts to use the data to make critical decisions.
Linking entities pulled from unstructured text is just hard. It’s hard to account for the “unknown unknowns” and make an association with a high degree of confidence. This is where entity disambiguation comes in. Relying on advances in Machine Learning (ML) and Artificial Intelligence (AI), disambiguation strategies can infer the identity of an entity in unstructured text automatically, using context clues.
While we humans can be a lot slower to work than an AI engine or other automated process, our years of experience and ability to apply complex reasoning mean we can make logical connections that computers still can’t.
Besides using AI/ML, then, entity disambiguation can also benefit from a “human in the loop”. Humans can be brought in to assist with providing disambiguation and creating connections between entities. An entity disambiguation process can loop in humans by assigning confidence scores to connections and re-routing issues to a human expert when that confidence falls below a specified threshold.
Adding a human in the loop helps balance speed with accuracy. It enables us to rely on AI/ML for the tasks at which they excel (churning through and summarizing vast volumes of data) while employing human intelligence to improve overall quality.
Given this, a good entity disambiguation process should have the following attributes.
An entity is the same entity regardless of how it’s labeled. “Japan” may be called 日本 (Nihon) in Japanese - but both labels refer to the same country.
A good entity disambiguation system can pick up on this and assign labels in different languages to the same entity. This requires more than translation, though. It means analyzing source documents in the original language and making that connection in context. This approach avoids any additional errors or ambiguities that might be introduced via the translation process itself.
If you’re going to have humans in the loop, you need some way to signal to them that an entity-linking operation should come under review. At Agolo, we accomplish this with an entity linking score that expresses a confidence level in any given link.
Entity linking scores should comprise a number of configurable factors. They should also be weighted relative to their domain. Different applications will need different weights on each of these factors based on the risks associated with an inaccurate association.
For example, in government intelligence and fraud detection use cases, you’d put more weight behind names and aliases. In a customer or technical support use case, however, you’d put more emphasis on context and entity coherence.
The entity linking score can form the basis of an end-to-end disambiguation process that combines the best of artificial and human intelligence.
As we noted above, a human can be an invaluable part of the process by providing corrections to low-confidence connections made by AI/ML. We can feed this data itself back into the AI/ML engines, thus improving their analytical accuracy in future runs. Humans can also review concepts that the entity extraction process found and associated with the entity, ensuring the entity’s identified content is accurate and relevant.
Human operators can also perform more advanced corrections. For example, they can identify that two “separate entities” are the same and merge them (e.g., recognizing that a name in Mandarin Chinese refers to an entity already tracked in English). Likewise, a human operator can split two entities that automated processes merged inaccurately.
Humans can also establish authoritative entities. Think of these like certified data sets in data management: an entity that’s been manually reviewed, curated, and deemed to represent the most up to date knowledge we have at this time.
An entity graph built on high-quality entity disambiguation - cross-lingual, automatically scored, and with humans in the loop - can deliver much more actionable, up-to-date intelligence you can’t produce by other means.
That’s why, at Agolo, we’ve built our Entity Intelligence engine to help you sift through mountains of unstructured data. You can use entity graphs generated by Agolo to power everything from GenAI apps to business intelligence dashboards and reports.
Agolo uses proprietary, state-of-the-art technology to improve accuracy in linking. For example, we use an entity linking score computed from a number of configurable, weighted factors. These include the entity’s name, associated semantic context, entity coherence, and lexical similarity (how closely related the terms are in spelling, punctuation, etc.).
At Agolo, we also use entity coherence, a measure of how coherent the use of an entity is within a discovered document. For example, the name “Ben” might be more closely related to the actor Ben Affleck if a document is discussing movies than to the musician Ben Harper.
Agolo is designed for humans in the loop. Using the entity linking score, subject matter experts can refine the automated output of Agolo Entity Intelligence, making the system increasingly accurate the more it’s used.
Want to learn more? Contact us today for a demo.