Coreference resolution is essential to 100% accurate RAG-based responses. But prior state-of-the-art coreference resolution had a 17.4% error rate. This article teaches how BSD Neural Networks can be used to produce 100% reliable coreference resolution.

Prerequisites

This article assumes that you are already familiar with BSD Neural Neural Networks.

If you are not familiar with this topic, kindly follow the link above before continuing with this article.

AI Hallucinations Due to Losing Track of Context

One large problem with AI language models is context. While the models are trying to predict the next appropriate word, it’s easy for them to lose track of the current context. And once the model loses track of context, it starts to spit out what appears to be “hallucinations” — utterly false statements written as if they are true.

One major reason for the model losing context is due to the nature of language itself. Consider pronouns as one example. A text may mention a person such as George Washington. Afterwards, the text can continue referring to Mr. Washington as ‘he’ or ‘him.’ If these pronouns are used in a very lengthy text, Large Language Models (LLMs) such as ChatGPT may eventually lose track of who ‘he’ and ‘him’ are referring to. When this happens, the model will create a ‘he’ or ‘him’ to generate intelligent sounding output — but it won’t be the correct ‘he’ or ‘him.’

Pronouns aren’t the only challenge. Consider another text regarding the movie Jaws. This text may repeatedly refer to Jaws via the phrase ‘the movie.’ Once again, in a lengthy text, the model may lose track of what ‘the movie’ is referring to.

Synonyms present yet another complication. Consider a text that begins: “Yesterday, I saw the movie Jaws.” However, the text might later refer to Jaws as a ‘flick’ — a synonym for movie. When the text uses the term ‘the flick’ later in the document, the model may lose track that this term is a reference to the ‘the movie Jaws.’

Now for the final issue that’s paramount for achieving AI results beyond human capability. LLMs can only process a certain number of tokens per exchange. In other words, the combined question and answer cannot exceed a given length. To accommodate this limitation, large texts are often split up into chunks. LLMs often only have access to part of the document when trying to create an accurate answer. (I.e. they often only have some of the chunks of the document — not the entire text — when generating their response.)

Now consider the following all-too-common scenario:

  • A large document that must be split into five chunks: Chunk A, Chunk B, Chunk C, Chunk D, and Chunk E.
  • The title of the movie Jaws is only contained in Chunk A.
  • Chunks B through Chunks E reference Jaws as ‘the movie,” “the flick,” and so on.
  • Chunk E discusses Roger Ebert’s review of ‘the flick.’
  • A chatbot must answer: “What did Roger Ebert think of the movie Jaws?”

In such a case, Chunk E might not even be discovered as having information since the word ‘Jaws’ is not contained within it. That’s one problem. However, even if Chunk E is chosen to be sent to the LLM because of the phrase ‘Roger Ebert thinks,’ there’s still the problem that the word ‘Jaws’ is not contained in Chunk E. Therefore, the LLM might say something like “the provided context does not say what Roger Ebert thinks of Jaws.”

Even worse, consider the existence of other chunked documents that contain ambiguous references to other movies. Some of these chunks also contain the phrase ‘Roger Ebert thinks.” But these chunks refer to different movies altogether. For example, maybe Roger Ebert loved the movie Jaws, but abhorred the movie We Bought a Zoo (sorry Matt Damon). If the ambiguous chunk regarding We Bought a Zoo is sent to the LLM then the LLM might infer from the question that the provided context is regarding Jaws. Therefore, the LLM will incorrectly write that Roger Ebert detested Jaws. Moreover, it could write that Ebert detested the scene in Jaw’s where the family tried to renovate a country-side zoo.

Of course that’s not the plot of Jaws. But I’ve personally seen ChatGPT make this exact type of error. ChatGPT makes assumptions regarding the topic based on the words in the question itself. Any ambiguity in the provided context can wrongly be conflated with the topic of the prompt itself.

One very fast way to reduce the occurrence of this error in your AI projects is to use a Natural Language Processing (NLP) technique called coreference resolution.

What is Coreference Resolution?

Coreference resolution is the task of finding all linguistic expressions (called mentions) in a given text that refer to the same entity. In practical terms, it refers to replacing the ambiguous references with the identity of the entity itself. For example:

  • Before: Review by Michael Wood. Yesterday I saw the movie Jaws. It was incredible. The movie left a lasting impression on me.
  • After: Review by Michael Wood. Yesterday Michael Wood saw the movie Jaws. The movie Jaws was incredible. The movie Jaws left a lasting impression on Michael Wood.

The text in italic-bold was added by the coreference resolution process. Notice how some ambiguous sentences are even converted into self-standing facts. For example:

  • Ambiguous: The movie left an impression on me.
  • Self-standing fact: The movie Jaws left a lasting impression on Michael Wood.

Coreference resolution is one technique for carrying content forward throughout the document. This can even result in carrying content into the chunks that are created when the document is split apart due to token limitations. Hence, coreference resolution is essential in RAG-based implementations.

SOTA Coreference Resolution Does Not Fulfill BSD Criteria

On the surface, neural networks trained to perform coreference resolution may appear to be doing so in a deterministic manner. Yet, the current state-of-the-art (SOTA) coreference resolution only has an accuracy of 83.6% (i.e., the Maverick_mes coreference model).

While SOTA coreference models may appear to have been trained in accordance with the above, the reality is that they are neither deterministic (as defined in the previous article) nor bounded in scope (as defined in the previous article). In other words, they do not meet either criterion — let alone both.

For example, Maverick_mes and other SOTA models (such as lingmess) were trained on a collection of documents known as the OntoNotes corpus. That was largely due to the fact that this document collection contains human annotations for coreference resolution — providing the model known endpoints on which to train. However, rarely discussed is the fact that the human annotators themselves disagreed with each other.

The OntoNotes corpus was introduced in a paper entitled “OntoNotes: A Large Training Corpus for Enhanced Processing.” Page 5 of that paper states: “All of the coreference annotation is being doubly annotated and adjudicated. Over the first two years, the overall average agreement between individual annotators and the adjudicated result for non-appositive coreference using the MUC coreference scorer was 86%.”

Researchers only agreed with the selected annotation 86% of the time in regards to standard coreferences. The reference to non-appositive coreferences is a reference to typical types of coreferences. An example of an atypical type (an appositive) is: “My teacher Mrs. Green is a tough grader.” Here, “Mrs. Green” is an appositive coreference to “my teacher.” The researchers treat such appositives as a special case of coreference resolution. Hence, in regards to typical, everyday coreferences, the researchers disagreed with the chosen annotation 14% of the time. Given that humans only agreed 86% of the time, then the dataset most certainly contains a large amount of subjective (i.e., non-deterministic) labels.

The rest of the dataset also includes subjectivity. For example, annotators were told to annotate nouns and verbs 50 sentences at a time. As long as there was 90%+ agreement among annotators, the annotations remained as is — without revision and clarification.

A 50-sentence sample of instances is annotated and immediately checked for inter-annotator agreement for all verbs and any noun with frequency over 100. ITA scores below 90% lead to a revision and clarification of the groupings by the linguist.

(Source: https://www.cs.cmu.edu/~hovy/papers/09OntoNotes-GALEbook.pdf)

That fact that scores can differ at all means that a deterministic process was not being applied (at least in terms of the way “deterministic” is used herein). The fact that up to 10% disagreement remains unrevised further documents that subjective nature of the process (despite the researchers referring to the allowed 10% discrepancy as being an “empirical process”). Thus, OntoNotes does not meet the determinism requirement of BSD.

Nor does it meet the bounded-scope requirement. The reason for the disagreements is due to the nature of some of the documents. OntoNotes not only contains well-written documents such as news articles, but it also includes broadcasts, “typically recordings of entire shows covering various topics.”

Naturally, people do not always speak using perfectly grammatical sentences — creating occasional confusion as to what they actually mean. (This can even occur in well-thought-out writings as well.)

Thus, the corpus includes a wide range of texts, including those with grammatical errors, and incomplete thoughts, thereby violating the bounded-scope requirement of BSD.

Grammatically correct text can be considered bounded in terms of Sentence Simplification, but it is unbounded in terms of Coreference Resolution.

Even the most complicated sentences must be structured around known grammatical rules. Thus, when splitting sentences, so long as it is done using clauses and prepositions, and provided the sentence is grammatically correct, the sentence can reliably be simplified.

However, coreference resolution is much more complex. Consider an article where “John Smith” is mentioned in the second sentence of paragraph one. The word ‘he’ is used to refer back to John Smith three paragraphs later. There are a large number of complex sentences that can exist between the reference to “John Smith” and the reference to “he.” Moreover, the sentences containing the references may themselves be complex.

So even input/output pairs that finally meet the deterministic requirement, likely will not meet the bounded requirement.

100% Accurate BSD Coreference Resolution

One way to reliably bound the problem is by applying BSD Sentence Splitting to the text (producing SS, or “Simplified Sentences”). The SS is then sent to a BSD Coreference Resolution process — a neural network that has been trained to perform coreference resolution on SS_Input / BSD_Target_Output pairs.

By solely using BSD Simplified Sentences in the training, the complexity is profoundly reduced — thereby bounding the size of the problem, such that a relatively small neural network can achieve zero as the output from the loss function during training.

Some implementations may bound the problem size even further by leaving all references at a certain distance unresolved. Training could include supplying five paragraphs of SS in each input of the training set. For example, if the selected maximum distance is five SS sentences, pronouns and other types of coreferences would only be resolved in the target output if the prior reference exists within the prior five SS sentences. Since this is an objective transformation, the neural network can (and will) learn to do the same.

Other implementations may choose for the target output to be the same as the training input for all instances of ambiguous coreference resolution.

Moreover, BSD implementations must choose deterministic rules for all nouns and named entities. For example, the implementation must choose whether the resolution carries forward noun phrases, compound noun phrases, or nested noun phrases. The selected choice must be applied throughout the training dataset. The same goes for the names of people, companies, and even countries (e.g., full country names and/or abbreviation).

So long as the training input is bounded (which is accomplished by using SS), and provided that the target outputs are deterministically derived from each SS, 100% accurate coreference resolution will be achieved.

Here, the metric of 100% accurate means that any linguistic elements that are rewritten will be done correctly. It does not mean that every potential linguistic reference will be replaced (for reasons stated above).

In other words, whenever a reference is replaced, you can fully rely on it being replaced with the correct reference. This is a gamechanger.

BSD Is All You Need

BSD makes it simple to achieve 100% reliable coreference resolution. The search for 100% accurate natural language process and AI is finally over. BSD is all you need.