In this article, you will learn the actual cause of chatbot hallucinations. More importantly, this article then discloses the three ways to completely eliminate hallucinations—once and for all.
Hallucinations
A hallucination in the context of LLMs typically refers to instances where the model generates information that contradicts either its training data or reality — essentially, making things up. In this article series, an LLM dutifully reproducing learned narratives is not categorized as a hallucination. Moreover, when the provided narrative conflicts with reality, it is the LLM’s job to present the narrative.
For example, a properly functioning LLM based on the Flat Earth Society’s website will state that the earth is flat. From the perspective of the information provided, that is the 100% accurate response.
Hence, for this article series, a hallucination is either something that contradicts the knowledge source or is unsupported by the knowledge source.
In other words, “hallucination rate” as used in this article refers to faithfulness — the degree to which the response remains faithful to the provided information (whether provided during training and/or at the time of query).
With this in mind, you are now ready to solve perhaps the biggest issue in AI — fully eliminating hallucinations.
Root Causes of Hallucinations
There are three reasons why LLMs deviate from the provided information. The three causes of hallucinations are: 1) malformed queries; 2) incomplete information; and 3) Noun-Phrase Collisions.
Malformed Queries
First, if a query is malformed then there is no possibility of an accurate response. Malformed queries either need to be corrected or rejected. Malformed queries include:
complex queries
misspellings
grammatically incorrect
ambiguous queries
An example of an ambiguous query is: “Does my aunt live in a dangerous neighborhood?” Naturally, the LLM needs the aunt’s address, which is missing from the query. A future article in this series provides methods for detecting and interactively correcting the above types of malformed queries.
After the query has been corrected, it can be sent to an LLM fine-tuned on detecting whether the query can be handled by the target LLM or not. The query can be rejected if it cannot be handled by the LLM. Thus, the target LLM will only be receiving properly formed queries, thereby eliminating the first cause of hallucinations.
A future article explains how to implement the following pipeline:
Incomplete Information
Second, if an LLM does not have the complete information, this too can result in hallucinations. Traditional RAG-based implementations completely ignore this cause of hallucinations. Consider a recipe for a chocolate cake as an example. If the RAG-based implementation only sends back part of the recipe, it would be impossible for the LLM to instruct how to bake the cake.
In regards to enterprises, consider a RAG-based chatbot on a banking website. Consider a user asking: “How do I open an account?” The RAG-based implementation needs to send the complete information to the LLM; otherwise a hallucination will ensue.
I will be explaining how to ensure RAG-based implementations always send complete information in the next article series after this one. The next article series is: “100% Accurate RAG Step by Step.” This series will be built on top of this one. In other words, the next series will presume that the reader has already read all the articles in this series.
Most Common Cause of Hallucinations—Noun-Phrase Collisions
Third, the remaining cause of hallucinations is the most common cause of hallucinations. The vast majority of hallucinations occur when LLMs provide a wrong answer to an appropriately written query even though the LLM has access to the full answer. All such hallucinations are caused by Noun-Phrase Collisions.
Key Discovery
LLMs can, and often do, hallucinate even when they have been trained on correct information. I tracked down the root cause of hundreds of such hallucinations. The root cause was always due to Noun-Phrase Collisions. Every single time.
A Noun-Phrase Collision occurs when two noun phrases refer to different entities yet have high vector similarity scores. For example, the names Alfonso and Afonso refer to two different entities, but they have high vector similarity scores. Thus, they are Noun-Phrase Collisions.
It appears that I was the first to discover that LLMs self-organize around noun phrases during pretraining. This discovery led to the discovery of Noun-Phrase Collisions as the cause of every hallucination where the LLM has been trained on the correct information.
I first discovered Noun-Phrase Collisions when conducting experiments in which the LLM was provided clear and correct information, yet the LLM still hallucinated nevertheless. Consider the following real-world example from GPT-4:
Prompt:
Based on the following passages, tell me about Afonso?
Passage 1: Afonso’s hair is blonde.
Passage 2: Alfonso is married to Gwen.
Response:
The passages provide information that Afonso has blonde hair
and is married to a woman named Gwen.
This is a hallucination. Alfonso, not Afonso, is married to Gwen. Now consider another real-world example from GPT-4:
Prompt:
Based on the following passages, tell me about Chuck?
Passage 1: Chuck’s hair is blonde.
Passage 2: Bartholomew is married to Gwen.
Response:
The passages only provide information that Chuck has blonde hair.
There is no further information about Chuck.
Notice that this response is 100% accurate, even though it is the exact same query just using two different names. The difference is that Afonso and Alfonso have a high vector similarity score (92.3%), whereas Chuck and Bartholomew have a lower vector similarity score (76.0%).
The examples in this article use ADA-002 to obtain vector embeddings, and they use cosine similarity to compute the vector similarity scores between the obtained vector embeddings.
In short, GPT-4 hallucinated because two different noun phrases (Alfonso and Afonso) refer to two different entities, yet they have high vector similarity scores. In other words, GPT-4 hallucinated because of a Noun-Phrase Collision.
After discovering Noun-Phrase Collisions where external knowledge was provided, I then conducted experiments on LLM parametric knowledge by tracing the hallucinations back to the training corpus, confirming that parametric hallucinations are due to the exact same cause.
For example, when asked about the mother of Alfonso II, GPT-4 gave information about Afonso VII. This is an example of a parametric hallucination as the LLM gave the answer based on its internal knowledge (not based on externally provided content). An analysis of the internet training corpus confirmed this to be due to a Noun-Phrase Collision. (As explained shortly below.)
Consider what I call “The Alfonso Debacle” as a perfect case in point.
Alfonso Debacle
I often discuss The Alfonso Debacle because it demonstrates thathallucinations are not caused by the reasons stated by LLM makers. To recap, a company called Vellum posted a ChatGPT-4 hallucination for the query: “Who was the mother of Afonso II, the third king of Portugal?”
ChatGPT 4 originally gave the wrong answer — Urraca of Castille. (You can verify this using gpt-4–0125-preview.)
OpenAI later fine-tuned ChatGPT 4 to provide the correct answer — Dulce of Aragon. However, OpenAI’s finetuning only “fixes” the original query verbatim. For example, here is a query that I submitted to ChatGPT 4 on September 2, 2024 (after the fine tuning):
Notice GPT 4 hallucinated on multiple levels:
“Afonso” was the third king of Portugal. Not “Alfonso.” There was no “Alfonso II” as the king of Portugal.
The mother of “Afonso II” was Dulce of Aragon, not Urraca of Castile.
The hallucination was triggered by changing “Afonso II” in the original query to “Alfonso II.” This demonstrates that OpenAI’s fine tuning did not overcome the original issue. In other words, ChatGPT 4 still treats “Alfonso” and “Afonso” as being the same (except where it is fine tuned to behave otherwise).
Secret Behind the Alfonso Debacle
Please study this section carefully. It will guide you to 100% accurate chatbots once you fully internalize it.
All-important insights come from asking all-important questions. Here’s the all-important question regarding the Alfonso Debacle: Why did ChatGPT 4 consistently choose Alfonso VII’s mother over Afonso II’s mother even though Afonso II was referenced in the query?
Prior to OpenAI fine tuning the answer, ChatGPT 4 would routinely give Alfonso VII’s mother instead of Afonso II. Why?
Take a moment and think about this. In fact, the Noun-Phrase Dominance Model came from asking this same question on literally hundreds of queries. By searching for the answer on each query, a pattern emerged. In fact, the same pattern emerged every single time.
Let me give you a hint. Consider this webpage statement regarding Alfonso VII: “Alfonso’s Mother was Urraca (1079–1126) called the Reckless was Queen of Castile…”
Now, think about that statement with the original query in mind: “Who was the mother of Afonso II, the third king of Portugal?” Why would the above statement be such an attractive route? Remember, focus on the noun phrases. After all, they determine the route.
Hopefully you took time to study the noun phrases in the query along with the noun phrases in the statement. You will always find your answer here.
Did you notice that “mother” is in both the query and the Alfonso webpage statement? Did you notice that “mother” is right next to “Alfonso” in the statement? From the LLM’s perspective, “Alfonso” is basically the same as “Afonso” and “mother” is a direct match. Therefore, if there is no “mother” close to “Afonso” then the LLM will choose the Alfonso/mother combination. (And that’s exactly what the ChatGPT 4 did until it was specifically fine tuned to behave otherwise.)
Take time to compare the location of the word “mother” for Alfonso VII to the location of “mother” for Afonso II. Notice that the word “mother” is extremely disconnected from “Afonso” in the latter link. That’s why the Alfonso route wins over Afonso for this query. It’s also the key to it all.
Every LLM hallucinates due to Noun-Phrase Collisions. For example, Noun-Phrase Collisions cause GPT-3.5 Turbo to wrongly conflate facts about magnesium with facts about calcium. They also cause GPT-3.5 Turbo to wrongly conflate facts about a Roth IRA with facts about a Roth 401k.
I created a video with demonstrations that you can conduct yourself to empirically prove that Noun-Phrase Collisions are the root cause of hallucinations. I strongly recommend you watch the video to understand that actual cause of hallucinations — and to understand how to finally eliminate them.
Noun-Phrase Collisions
I tracked down the origin of literally hundreds of hallucinations. They were all caused by such Noun-Phrase Collisions — they were all traceable to the LLM wrongly treating disparate noun phrases as synonyms due to their high vector similarity scores.
So how do you programmatically identify Noun-Phrase Collisions?
Tokens
It is important to note that LLMs typically convert text into numerical tokens. For example, GPT-4o converts “Chuck” into a single token: [187874]. Bartholomew is converted into four tokens: [4622, 134710, 747, 86]. Afonso is converted into two tokens: [32, 104460]. Alfonso is converted into [2348, 104460].
Chuck: [187874]
Bartholomew: [4622, 134710, 747, 86]
Afonso: [32, 104460]
Alfonso: [2348, 104460]
For purposes of programmatically identifying Noun-Phrase Collisions, a high vector similarity score can refer to a high vector similarity score on the entire noun phrase and/or a high degree of similarity between a subset of the noun phrase’s numerical tokens. The latter is a straightforward criterion. For example, consider how GPT-4o represents 1968 and 1969. GPT-4o converts 1968 into two tokens: [6514, 23]. GPT-4o converts 1969 into two tokens: [6514, 24]. Notice also that Afonso and Alfonso share token 104460 in the second position.
This is important because GPT models do not see “1968” or “1969” as both of these concepts are outside their vocabulary (i.e., there is no single token dedicated to expressing either of them). With this mind, consider text that contains one event that occurred in 1968 ([6514, 23]) and another event that occurred in 1969 ([6514, 24]).
1968: [6514, 23]
1969: [6514, 24]
Consider a RAG-based implementation which has both 1968 and 1969 in the provided content. During response generation, when the LLM outputs token 6514, it can wrongly conclude that it has two token paths to choose from (either token 23 or token 24). If it chooses the wrong one, then it will produce a hallucination by wrongly attributing something that happened in 1968 with something that happened in 1969.
These token-level collisions are responsible for LLMs having extraordinarily high hallucination rates for dates, part numbers, PubMed IDs, and more.
When dealing with language (i.e., narrative text), text similarity scores can be measured based on the vector embeddings for the text themselves. Text similarity scores can also be computed by looking for identical tokens (such as the shared tokens between 1968 and 1968). It’s essential to resolve both types of Noun-Phrase Collisions.
Noun-Phrase Collisions Will Always Exist in LLM Parametric Knowledge
Noun-Phrase Collisions exist because LLMs are typically trained for creative language generation. Therefore, they are trained to recognize that words such as “car,” “automobile,” and “vehicle” can often be used interchangeably, e.g.: “My automobile broke down. I don’t like this car. This is the last vehicle I’m going to buy.” In this example, the LLM needs to understand that all three words refer to the same thing, and it needs to do so in a mathematical manner.
Likewise, when generating a response, it needs to know the variety of words that it can choose from to generate a convincing answer.
Noun-Phrase Collisions occur because the LLM typically learns to treat semantically similar words as synonyms during pretraining. During instruction fine tuning, many of the errant associations get overwritten. However, it is impossible to finetune all the errant associations — paving the way for future errors (called “hallucinations”).
While conflating “Alfonso” and “Afonso” may seem somewhat intuitive, consider the fact that GPT-3.5 Turbo routinely conflates “magnesium” and “calcium.” That is because the magnesium/calcium vectors have an 87.2% similarity, and the instruction tuning was not sufficient for the LLM to learn that they refer to different things despite their high similarity score. (See video above for demonstration of the magnesium/calcium collision.)
The fact that fine tuning overcomes these errors is seen in OpenAI’s continual retraining of its models to fix widely publicized hallucinations, something they acknowledge in the GPT-4 system card. “For tackling open-domain hallucinations, we collect real-world ChatGPT data that has been flagged by users as being not factual….”
Noun-Phrase Collisions are born during model pretraining. Noun-Phrase Collisions in the pretrained model shall be referred to as inherent Noun-Phase Collisions. Subsequent instruction tuning teaches the model to overcome the inherent Noun-Phrase Collisions. For example, GPT-4’s pretraining has a Noun-Phrase Collision for Alfonso and Afonso. The original instruction tuning did not correct for this issue (e.g., GPT-4–0125-preview). After the Afonso/Alfonso collision became public, OpenAI retrained GPT-4 using instruction tuning to correct for some of the inherent collisions (e.g., GPT-4–0613). The change in GPT-4’s behavior regarding Afonso/Alfonso demonstrates that the Noun-Phrase Collisions are indeed overcome during instruction tuning.
However, the problem is that it is impossible to train away all inherent Noun-Phrase Collisions. Thus, each LLM remains ready to hallucinate each time that a user’s query evokes an inherent Noun-Phrase Collision that was not corrected during instruction tuning.
For example, GPT-4 once routinely conflated “Alfonso” and “Afonso.” OpenAI used fine tuning to fix some of the errant conflations, allowing later models to make fewer mistakes. However, not all Alfonso/Afonso conflations were fixed. The conflation is only fixed when a user types in a query that is very similar to the one(s) used during fine tuning. For other queries, GPT-4 still wrongly conflates Afonso and Alfonso. (See example above.)
This exemplifies one problem with trying to fix the problem through fine tuning. When fine tuning on facts, the LLM does not tend to generalize as well as it does when fine tuning on behavior. You may fix the exact, highly-publicized queries, but other queries may still experience the errant conflation.
For example, consider an LLM that can perfectly answer virtually any question regarding cows, but hallucinates on questions about dogs. Rather than going back to pretraining, the LLM maker decides to “fix” the publicized dog hallucinations through fact-based fine tuning instead. Unfortunately, for every one new query about dogs that gets added in, the LLM forgets how to answer two queries about cows. While the hallucination rate for the publicized dog queries gets “fixed,” the overall hallucination gets much worse.
Catastrophic Forgetting
This paradox is known as catastrophic forgetting. One can accurately conceptualize fine tuning on facts as creating an idiot savant. The LLM will indeed parrot back the provided facts, but it will also develop a degree of “dementia” or “amnesia” in other areas.
That is because LLMs have a finite number of parameters (the mechanisms they use for storing both facts and learned behaviors). Fine tuning means that some parameters that once handled one or more facts must now be reallocated (i.e., they are overwritten) to account for the new fact. Thus, OpenAI’s stated method of dealing with hallucinations actually increases the overall hallucination rate of their models instead.
Eliminating Hallucinations Once And For All
Understanding the above is helpful in understanding how to finally solve the problem of LLM hallucinations once and for all. First, it is helpful to know that LLMs cannot be finetuned to learn all facts about everything. The paradox of catastrophic forgetting ensures this. Second, it is helpful to know that LLMs often erroneously conflate references to different nouns when those references have a high similarity score.
In other words, there is no way to eliminate Noun-Phrase Collisions from LLM parametric knowledge. Nor is there any way to fully eliminate the Noun-Phrase Collisions within the models’ weights and biases (i.e., parameters) due to catastrophic forgetting.
Thus, eliminating hallucinations requires accepting the existence of such collisions, and then, fully addressing the issue head on.
Why RAG Fails to Eliminate Hallucinations
RAG is often promoted as an answer to hallucinations. However, it is far from a panacea. RAG often has double digit hallucination rates.
Second, even when the correct information is sent, LLMs can still hallucinate (due to noun phrase collisions).
Creating 100% Accurate RAG requires two steps:
Solely retrieving the precise facts that are relevant to the query (not hundreds of potentially relevant document excerpts).
Ensuring the chatbot faithfully presents the facts without any deviation whatsoever.
This current article series explains how to achieve 100% accurate faithfulness. The next article series explains how to build an information storage and retrieval system that instantly identifies the precise facts that are relevant to the user’s query. This combination provides 100% accurate RAG.
For now, it’s essential to know how to eliminate hallucinations from both internal parametric knowledge and externally retrieved knowledge as well. In other words, it’s essential how to ensure perfect faithfulness.
The ABCs of Eliminating Hallucinations described immediately below resolve the issue of faithfulness. They provide the three systematic ways to fully address the issue of noun phrase collisions.
ABCs of Eliminating Hallucinations
Given the inherent existence of Noun-Phrase Collisions inside the LLM itself, and given that these collisions result in hallucinations, there are only three ways to eliminate hallucinations:
Avoid noun-phrase collision routes during generative LLM tasks
Bypass generative LLM tasks (thereby bypassing the issue)
Correct for the errors caused by the noun-phrase collisions
I refer to this as the ABCs of Eliminating Hallucinations: (A)void, (B)ypass, and Correct.
Each method is briefly introduced below. The step-by-step instructions for implementing each method are given in separate articles—one article per method.
However, all three methods depend on Formatted Facts (FFs). FFs are the fundamental building block of 100% accurate AI. They are key to completely eliminating hallucinations from chatbot responses.
Formatted Facts and Fully-Formatted Facts
As stated in the first article, Formatted Facts (FFs) are statements that are both simple and self-contained.
The concept of Fully-Formatted Facts (FFFs) goes one step further. FFFs are a collection of FFs that are devoid of Noun-Phrase Collisions.
For example, if a collection of FFs includes statements about magnesium and other statements about calcium then that group of FFs does not qualify as being an FFF.
With FFFs, all semantically similar noun-phrases in the collection of FFs must refer to the same entity. For example, if the FFs contain semantically similar words (such as car, automobile, and vehicle) that is okay if they all refer to the same entity.
FFFs are the key to avoiding noun-phrase collision routes during generative LLM tasks.
The Bypassing and Correcting methods solely require FFs (not FFFs).
The fourth row will be filled in in the next article series: “100% Accurate RAG: Step by Step.”
Formatted Facts
Avoiding hallucinations relies on Fully-Formatted Facts (FFFs). Bypassing and Correcting hallucinations rely on Formatted Facts (FFs). However, given that Formatted Facts are a subcomponent of Fully-Formatted Facts; Formatted Facts are the foundation of all three methods of completely eliminating hallucinations from chatbot responses. Formatted Facts are the missing key to 100% accurate AI.
Therefore, the next article in this series (Part Four) details the various pipelines for producing Formatted Facts—from the basic pipeline up through pipelines capable of handling complex text such as scientific, medical, and financial information.
Avoiding Hallucinations
Part Five of this series explains the Avoiding method of hallucination elimination. As stated above, Avoiding hallucinations requires using Fully-Formatted Facts (FFFs). Therefore, Part Five of this series instructs on how to identify and remove noun-phrase collisions to convert a series of FFs into FFFs.
Bypassing Hallucinations
Part Six of this series explains the Bypassing method of hallucination elimination. In short, this article explains how to convert generative LLM tasks into non-generative tasks—thereby bypassing the issue of noun-phrase collisions altogether. Doing so results in 100% accurate responses—every single time.
Correcting Hallucinations
Part Seven of this series explains the Correcting method of hallucination elimination. This method is akin to what is commonly referred to as “grounding” and “reverse RAG.” The key difference is that Formatted Facts (FFs) are used for the reverse RAG process. FFs provide the missing key to 100% accurate grounding.
Roadmap Ahead
As stated above:
Part Four: Formatted Facts Pipelines
Part Five: Avoiding Method of Hallucination Elimination
Part Six: Bypassing Method of Hallucination Elimination
Part Seven: Correcting Method of Hallucination Elimination
After that, various articles discuss additional sentence simplification and self-containment processes—including a process that can be used to training tiny neural networks on 100% accurate Formatted Facts generation. Thus, you will not only learn how 100% accurate AI is generated, but also how it can be generated on extremely small models for fast, accurate, and cheap responses.
This series then ends with the “Grand Finale.”
Grand Finale
This series culminates in a very special surprise—the unveiling of an industry-disrupting demonstration empirically documenting that 100% accurate AI is finally here. This article reveals the results of a direct head-to-head with OpenAI, Anthropic, and Perplexity.
In other words, you will see empirical proof that these steps do indeed outperform all the major AI players including: OpenAI, Anthropic, Perplexity, and more.
You will see empiricle proof that 100% accurate AI is already here.
Coreference resolution is essential to 100% accurate RAG-based responses. But prior state-of-the-art coreference resolution had a 17.4% error rate. This article teaches how BSD Neural Networks can be used to produce 100% reliable coreference resolution.
If you are not familiar with this topic, kindly follow the link above before continuing with this article.
AI Hallucinations Due to Losing Track of Context
One large problem with AI language models is context. While the models are trying to predict the next appropriate word, it’s easy for them to lose track of the current context. And once the model loses track of context, it starts to spit out what appears to be “hallucinations” — utterly false statements written as if they are true.
One major reason for the model losing context is due to the nature of language itself. Consider pronouns as one example. A text may mention a person such as George Washington. Afterwards, the text can continue referring to Mr. Washington as ‘he’ or ‘him.’ If these pronouns are used in a very lengthy text, Large Language Models (LLMs) such as ChatGPT may eventually lose track of who ‘he’ and ‘him’ are referring to. When this happens, the model will create a ‘he’ or ‘him’ to generate intelligent sounding output — but it won’t be the correct ‘he’ or ‘him.’
Pronouns aren’t the only challenge. Consider another text regarding the movie Jaws. This text may repeatedly refer to Jaws via the phrase ‘the movie.’ Once again, in a lengthy text, the model may lose track of what ‘the movie’ is referring to.
Synonyms present yet another complication. Consider a text that begins: “Yesterday, I saw the movieJaws.” However, the text might later refer to Jaws as a ‘flick’ — a synonym for movie. When the text uses the term ‘theflick’ later in the document, the model may lose track that this term is a reference to the ‘the movieJaws.’
Now for the final issue that’s paramount for achieving AI results beyond human capability. LLMs can only process a certain number of tokens per exchange. In other words, the combined question and answer cannot exceed a given length. To accommodate this limitation, large texts are often split up into chunks. LLMs often only have access to part of the document when trying to create an accurate answer. (I.e. they often only have some of the chunks of the document — not the entire text — when generating their response.)
Now consider the following all-too-common scenario:
A large document that must be split into five chunks: Chunk A, Chunk B, Chunk C, Chunk D, and Chunk E.
The title of the movie Jaws is only contained in Chunk A.
Chunks B through Chunks E reference Jaws as ‘the movie,” “the flick,” and so on.
Chunk E discusses Roger Ebert’s review of ‘the flick.’
A chatbot must answer: “What did Roger Ebert think of the movie Jaws?”
In such a case, Chunk E might not even be discovered as having information since the word ‘Jaws’ is not contained within it. That’s one problem. However, even if Chunk E is chosen to be sent to the LLM because of the phrase ‘Roger Ebert thinks,’ there’s still the problem that the word ‘Jaws’ is not contained in Chunk E. Therefore, the LLM might say something like “the provided context does not say what Roger Ebert thinks of Jaws.”
Even worse, consider the existence of other chunked documents that contain ambiguous references to other movies. Some of these chunks also contain the phrase ‘Roger Ebert thinks.” But these chunks refer to different movies altogether. For example, maybe Roger Ebert loved the movie Jaws, but abhorred the movie We Bought a Zoo (sorry Matt Damon). If the ambiguous chunk regarding We Bought a Zoo is sent to the LLM then the LLM might infer from the question that the provided context is regarding Jaws.Therefore, the LLM will incorrectly write that Roger Ebert detested Jaws. Moreover, it could write that Ebert detested the scene in Jaw’s where the family tried to renovate a country-side zoo.
Of course that’s not the plot of Jaws. But I’ve personally seen ChatGPT make this exact type of error. ChatGPT makes assumptions regarding the topic based on the words in the question itself. Any ambiguity in the provided context can wrongly be conflated with the topic of the prompt itself.
One very fast way to reduce the occurrence of this error in your AI projects is to use a Natural Language Processing (NLP) technique called coreference resolution.
What is Coreference Resolution?
Coreference resolution is the task of finding all linguistic expressions (called mentions) in a given text that refer to the same entity. In practical terms, it refers to replacing the ambiguous references with the identity of the entity itself. For example:
Before:Review by Michael Wood. Yesterday I saw the movie Jaws. It was incredible. The movie left a lasting impression on me.
After:Review by Michael Wood. Yesterday Michael Wood saw the movie Jaws. The movie Jaws was incredible. The movie Jaws left a lasting impression on Michael Wood.
The text in italic-bold was added by the coreference resolution process. Notice how some ambiguous sentences are even converted into self-standing facts. For example:
Ambiguous: The movie left an impression on me.
Self-standing fact: The movie Jaws left a lasting impression on Michael Wood.
Coreference resolution is one technique for carrying content forward throughout the document. This can even result in carrying content into the chunks that are created when the document is split apart due to token limitations. Hence, coreference resolution is essential in RAG-based implementations.
SOTA Coreference Resolution Does Not Fulfill BSD Criteria
On the surface, neural networks trained to perform coreference resolution may appear to be doing so in a deterministic manner. Yet, the current state-of-the-art (SOTA) coreference resolution only has an accuracy of 83.6% (i.e., the Maverick_mes coreference model).
While SOTA coreference models may appear to have been trained in accordance with the above, the reality is that they are neither deterministic (as defined in the previous article) nor bounded in scope (as defined in the previous article). In other words, they do not meet either criterion — let alone both.
For example, Maverick_mes and other SOTA models (such as lingmess) were trained on a collection of documents known as the OntoNotes corpus. That was largely due to the fact that this document collection contains human annotations for coreference resolution — providing the model known endpoints on which to train. However, rarely discussed is the fact that the human annotators themselves disagreed with each other.
The OntoNotes corpus was introduced in a paper entitled “OntoNotes: A Large Training Corpus for Enhanced Processing.” Page 5 of that paper states: “All of the coreference annotation is being doubly annotated and adjudicated. Over the first two years, the overall average agreement between individual annotators and the adjudicated result for non-appositive coreference using the MUC coreference scorer was 86%.”
Researchers only agreed with the selected annotation 86% of the time in regards to standard coreferences. The reference to non-appositive coreferences is a reference to typical types of coreferences. An example of an atypical type (an appositive) is: “My teacher Mrs. Green is a tough grader.” Here, “Mrs. Green” is an appositive coreference to “my teacher.” The researchers treat such appositives as a special case of coreference resolution. Hence, in regards to typical, everyday coreferences, the researchers disagreed with the chosen annotation 14% of the time. Given that humans only agreed 86% of the time, then the dataset most certainly contains a large amount of subjective (i.e., non-deterministic) labels.
The rest of the dataset also includes subjectivity. For example, annotators were told to annotate nouns and verbs 50 sentences at a time. As long as there was 90%+ agreement among annotators, the annotations remained as is — without revision and clarification.
A 50-sentence sample of instances is annotated and immediately checked for inter-annotator agreement for all verbs and any noun with frequency over 100. ITA scores below 90% lead to a revision and clarification of the groupings by the linguist.
That fact that scores can differ at all means that a deterministic process was not being applied (at least in terms of the way “deterministic” is used herein). The fact that up to 10% disagreement remains unrevised further documents that subjective nature of the process (despite the researchers referring to the allowed 10% discrepancy as being an “empirical process”). Thus, OntoNotes does not meet the determinism requirement of BSD.
Nor does it meet the bounded-scope requirement. The reason for the disagreements is due to the nature of some of the documents. OntoNotes not only contains well-written documents such as news articles, but it also includes broadcasts, “typically recordings of entire shows covering various topics.”
Naturally, people do not always speak using perfectly grammatical sentences — creating occasional confusion as to what they actually mean. (This can even occur in well-thought-out writings as well.)
Thus, the corpus includes a wide range of texts, including those with grammatical errors, and incomplete thoughts, thereby violating the bounded-scope requirement of BSD.
Grammatically correct text can be considered bounded in terms of Sentence Simplification, but it is unbounded in terms of Coreference Resolution.
Even the most complicated sentences must be structured around known grammatical rules. Thus, when splitting sentences, so long as it is done using clauses and prepositions, and provided the sentence is grammatically correct, the sentence can reliably be simplified.
However, coreference resolution is much more complex. Consider an article where “John Smith” is mentioned in the second sentence of paragraph one. The word ‘he’ is used to refer back to John Smith three paragraphs later. There are a large number of complex sentences that can exist between the reference to “John Smith” and the reference to “he.” Moreover, the sentences containing the references may themselves be complex.
So even input/output pairs that finally meet the deterministic requirement, likely will not meet the bounded requirement.
100% Accurate BSD Coreference Resolution
One way to reliably bound the problem is by applying BSD Sentence Splitting to the text (producing SS, or “Simplified Sentences”). The SS is then sent to a BSD Coreference Resolution process — a neural network that has been trained to perform coreference resolution on SS_Input / BSD_Target_Output pairs.
By solely using BSD Simplified Sentences in the training, the complexity is profoundly reduced — thereby bounding the size of the problem, such that a relatively small neural network can achieve zero as the output from the loss function during training.
Some implementations may bound the problem size even further by leaving all references at a certain distance unresolved. Training could include supplying five paragraphs of SS in each input of the training set. For example, if the selected maximum distance is five SS sentences, pronouns and other types of coreferences would only be resolved in the target output if the prior reference exists within the prior five SS sentences. Since this is an objective transformation, the neural network can (and will) learn to do the same.
Other implementations may choose for the target output to be the same as the training input for all instances of ambiguous coreference resolution.
Moreover, BSD implementations must choose deterministic rules for all nouns and named entities. For example, the implementation must choose whether the resolution carries forward noun phrases, compound noun phrases, or nested noun phrases. The selected choice must be applied throughout the training dataset. The same goes for the names of people, companies, and even countries (e.g., full country names and/or abbreviation).
So long as the training input is bounded (which is accomplished by using SS), and provided that the target outputs are deterministically derived from each SS, 100% accurate coreference resolution will be achieved.
Here, the metric of 100% accurate means that any linguistic elements that are rewritten will be done correctly. It does not mean that every potential linguistic reference will be replaced (for reasons stated above).
In other words, whenever a reference is replaced, you can fully rely on it being replaced with the correct reference. This is a gamechanger.
BSD Is All You Need
BSD makes it simple to achieve 100% reliable coreference resolution. The search for 100% accurate natural language process and AI is finally over. BSD is all you need.
Do you want to train neural networks to achieve 100% accuracy on virtually every language task — including 100% accurate chatbot responses? Bound-Scope Deterministic (BSD) Neural Networks are what you’ve been looking for. BSD is all you need.
With the BSD training method, neural networks achieve the same precision with language that deterministic programming does with numbers, thereby solving virtually every natural language processing (NLP) task all at once. This includes:
100% accurate low-level NLP tasks, such as Sentence Splitting and Named Entity Recognition.
100% accurate high-level NLP tasks, such as Summarization and Coreference Resolution.
100% accurate LLM tasks, such as 100% hallucination-free Question/Answering and even lengthy exposition.
This article teaches broader skill of BSD neural network training by showing how to use it for the NLP task of Sentence Splitting. Future articles will teach how to use BSD for 100% accurate coreference resolution, 100% accurate summarization, and even 100% accurate chatbot responses.
BSD neural network training is the foundation for it all.
Series Contents
This series consists of one dozen articles. Everything is laid bare.
This pipeline transforms text into independent, self-contained statements (which my company calls “Formatted Facts”).
But this simple pipeline had a very big problem. There was no reliable method for splitting sentences; nor was there any reliable method for performing coreference resolution.
I currently work at Acurai Inc. At Acurai we have a saying, “Less Broken is Still Broken.” In other words, a chatbot that answers questions correctly 80% of the time is still “broken.”
To date, researchers have been pursuing less broken, but still broken, methods for natural language processing (NLP) tasks such as sentence splitting and coreference resolution. For example:
This is problematic on two fronts. First, as data passes through a pipeline, the errors of each component multiply. Thus, even this simple two-step pipeline has a 33% error rate caused by the compounding of errors.
Second, State-of-the-Art (SOA) Sentence splitting was achieved by fine tuning LLMs. In other words, even fine-tuned LLMs have an approximately 20% error rate for Sentence Splitting. (See link above.) Thus, LLMs are not powerful enough, in and of themselves.
BSD—Answer To OpenAI’s Hallucination Paradox
OpenAI’s models have been hallucinating more than their predecessors, and “OpenAI has no idea why.” GPT-4o hallucinates more than GPT-4. o1 hallucinates more than GPT-4o. o3 hallucinates more than o1.
As explained in detail below, BSD is the only way to achieve 100% accurate neural networks. More importantly, the rate of hallucinations is proportional to the degree in which neural network training deviates from BSD. OpenAI’s training has been increasingly moving away from BSD, thereby resulting in increasingly more hallucinations.
BSD is the answer to the hallucination paradox. BSD explains everything.
BSD — Fundamental Building Block of AGI
I find it odd that chatbot makers keep hyping “AGI” when LLMs cannot even accurately split complex sentences into simpler ones; nor can LLMs even count the number of “R’s” in raspberry.
LLMs were notoriously mocked for not being able to count the number of R’s in strawberry. Therefore, they were eventually fine tuned to be able to do so. But that didn’t teach them how to count letters in other words (such as raspberry). Consider o1:
It may be possible that o1 has been fine tuned at this point for raspberry. But let’s face it, fine tuning models on a per question basis is not any form of “intelligence.” One can simply create a large database to unintelligently retrieve the answers in that case.
Artificial General Intelligence (AGI) is going to need to be able to access information in real time, and be able to digest the facts contained in the information with 100% accuracy. Thus, solving the 100% accurate Formatted Facts (FFs) Pipeline is an essential step towardsAGI.
I invented BSD Neural Networks to solve the 100% accurate Formatted Facts (FFs) Pipeline. However, I quickly realized that BSD Neural Network training was what AI enthusiasts have been looking for all along. In addition to 100% accurate sentence splitting and coreference resolution, BSD Neural networks can be used for 100% accurate named entity recognition, parts-of-speech tagging, document summarization, and even 100% accurate chatbot responses.
Not to mention that in doing so, BSD Neural Networks are the essential building blocks of reasoning and AGI. In fact, future articles will disclose the secrets to reliable AI reasoning (with BSD Neural Networks as the guaranteers of the reliability).
Sentence Splitting as Fact Extraction
BSD would be a remarkable discovery even if it only resulted in 100% accurate sentence splitting. After all, data scientists have been pursuing accurate sentence splitting for over 55 years. The quest for accurate Sentence Splitting has gone from rule-based approaches (1960s-1970s), to statistical approaches (1980s-1990s), to machine learning (2000s-2010s), to deep learning and neural networks (2010s-present).
Sentence splitting has long been recognized as a key component of fact extraction. Consider the following two real-world sentences:
The last 4 kilometres (2.5 mi) of the remaining original _Reichsautobahn_, a section of A 11 northeast of Berlin near Gartz built in 1936 — the westernmost remainder of the never-finished Berlinka — was scheduled for replacement around 2015.\[_needs update_\] Roadway condition is described as “deplorable”; the 25 metres (82 ft)\-long concrete slabs, too long for proper expansion, are cracking under the weight of the traffic as well as the weather.
Too many people use run on sentences—including Wikipedia authors and news’ journalists (and including this present author as well :-)). Meanwhile, LLMs struggle with complex sentences—both when pre-training parametric knowledge, and when such content is sent as input in RAG-based implementations.
Now consider the above sentence after being split by a fine-tuned BSD Neural Network using the same step-by-step method disclosed below:
The last 4 kilometres (2.5 mi) of the remaining original _Reichsautobahn_ was scheduled for replacement around 2015.
\[_needs update_\].
The last 4 kilometres of the remaining original _Reichsautobahn_ is a section of A 11.
The section of A 11 is northeast of Berlin.
The section of A 11 is near Gartz.
The section of A 11 was built in 1936.
The section of A 11 is the westernmost remainder of the never-finished Berlinka.
Roadway condition is described as “deplorable”.
The 25 metres (82 ft)-long concrete slabs are too long for proper expansion.
The 25 metres (82 ft)-long concrete slabs are cracking under the weight of the traffic.
The 25 metres (82 ft)-long concrete slabs are cracking under the weather.
Now it is trivial for LLMs to accurately answer questions.
AI Accuracy Plateau
This article teaches BSD using Sentence Splitting for a number of reasons. One of which is to demonstrate why NLP tasks have accuracy plateaus and how to permanently fix the plateau issue.
It’s no secret that AI has hit an accuracy plateau in terms of extractive question and answering. Thus, AI powerhouses have moved on to focus on images & video, mathematical & scientific reasoning, and code generation. Consider OpenAI’s very impressive strides in its newest image generator, and its current $3 billion pursuit of acquiring Windsurf.
However, little press is given to the fact that the major AI players have moved past trying to solve the extractive QA accuracy plateau. Nevertheless, TechCruch recently noted “OpenAI’s new reasoning AI models hallucinate more.” TechCrunch states:
According to OpenAI’s internal tests, o3 and o4-mini, which are so-called reasoning models, hallucinate more often than the company’s previous reasoning models — o1, o1-mini, and o3-mini — as well as OpenAI’s traditional, “non-reasoning” models, such as GPT-4o.
Perhaps more concerning, the ChatGPT maker doesn’t really know why it’s happening.
In its technical report for o3 and o4-mini, OpenAI writes that “more research is needed” to understand why hallucinations are getting worse as it scales up reasoning models. O3 and o4-mini perform better in some areas, including tasks related to coding and math. But because they “make more claims overall,” they’re often led to make “more accurate claims as well as more inaccurate/hallucinated claims,” per the report.
For now, OpenAI is currently not focusing on eliminating extractive QA hallucinations.
BSD — Key to Breaking through the AI Accuracy Plateau
SOTA methods for extracting facts from complex sentences hit an 80% accuracy ceiling — a seemingly insurmountable plateau. As the SOTA researchers stated:
The results are shown in Table 2. In all tables, best results are shown in bold. In both cases, scores increased across metrics as training size increased, although by smaller increments from the 3K mark upwards. Best results are split between the two largest training datasets on both BiSECT and DeSSE, indicative of a potential ceiling in terms of improvements from training data augmentation. It might also be the case that the observed plateauing resulted from a lack of variety in the added training data, but verifying this hypothesis was beyond the scope of this work.
The researchers discovered that accuracy did not improve as more training data was added. They reached a plateau.
Importantly, the researchers hypothesized that the plateau could be solved by adding variety to the dataset. This hypothesis is not only wrong, it is actually backwards. As you will see below, the problem is due to variety. Thus, variety isn’t the solution, it’s the problem.
Fortunately, this bold statement can be empirically demonstrated—showing that the industry-wide assumption on neural network training is literally backwards.
5-Entry BSD Dataset Outperforms 1-Million-Entry SOTA Datasets
To empirically demonstrate that BSD is the missing key, a 5-entry BSD dataset was tested. It not only produced a result unheard of in the field—but the 5-entry dataset literally produced zero errors.
As data scientists often say: “The proof is in the pudding.” I couldn’t agree more. Now, here’s the pudding.
The SOTA datasets used to train neural networks for Sentence Splitting and Rephrasing are:
DeSSE
BiSECT
WikiSplit
WebSplit.
BiSect has 928,440 entries. WebSplit has 1,331,515 entries.
Despite having datasets with one million training examples, SOTA Sentence Splitting hit an accuracy ceiling of approximately 80%. Meanwhile, a 5-entry BSD set achieved 100% accuracy on a much more stringent test than was used to assess the SOTA method.
You read that correctly. A 5-entry BSD dataset outperforms a 1-million-entry dataset. Moreover, the BSD dataset demonstrated 100% accuracy as well. BSD truly is the revolution that the AI industry has been searching for.
Method: A 5-Sample BSD Dataset was used in few shot prompting to split sentences from 500 BBC news articles.
Result: The BSD method split the sentences with 100% accuracy.
Meanwhile, the SOTA method first fine tunes an LLM on up to 1 million examples, and then prompts the LLM to split sentences — resulting in an 18.4% error rate on narrative text (the same type of text used in BBC news articles).
The 5-sample few shot prompt demonstrates the remarkable breakthrough of BSD. Naturally, fine tuning models on a larger BSD dataset is recommended to ensure 100% accuracy on an ongoing basis. But the fact that a 5-sample few shot prompt results in 100% accuracy on a stringent test shows BSD to be the correct way to produce perfectly reliable results.
Current Method of Training Language Models is Literally Backwards
5 BSD entries significantly outperformed neural networks trained on over one million other types of entries. So what is the “secret”? The secret is that variety is the source of hallucinations—not the answer to them.
Each BSD dataset entry is structured in a very specific manner that communicates to the neural network precisely what it needs to learn to do. This has been the missing key to 100% accurate NLP neural networks. At least in terms of using supervised training to teach neural networks to perform natural language processing (NLP) tasks.
Supervised neural network training consists of providing the model multiple input => output pairs. The dataset tells the model what the output should be for any given input. The goal is for the model to discover the patterns that exist across the dataset; so that it can then transform inputs that it has never seen before into the desired outputs.
However, the industry has been training language-based neural networks using stochastic, non-deterministic methods. For example, the WebSplit dataset provides many grammatically correct outputs for each input. Consider the following sentence:
Input: “Auburn is part of Lee County in Alabama which is situated within the state of Alabama in the United States where one of the ethnic groups in the United States are the African Americans.”
The WebSplit dataset contains 64 alternative splits for this sentence alone. In other words, there are 64 entries in the dataset where the input is this same sentence. However, each of the 64 outputs provides one grammatically correct alternative for splitting that sentence. Hence, for this one sentence, there are 64 input => output pairs, where each output gives an alternative correct split.
In other words, there are 64 entries in the dataset where the above sentence is the input. Each of the 64 entries gives one acceptable output. For example, three of the output examples include:
Output: “Auburn is part of Lee County in Alabama . Lee County is situated within the state of Alabama . Alabama is in the United States . One of the ethnic groups in the United States are the African Americans .”
Output: “Auburn is part of Lee County , Alabama in the United States . African Americans are an ethnic group within the United States .”
Output: “Auburn , Alabama is part of Lee County . Lee County is in the state of Alabama . Alabama is in the United States . African Americans are an ethnic group within the United States .”
And so on. That’s 64 entries. Each entry has the same input. Each entry has a different output.
Notice that this is the opposite of determinism. Determinism, by definition, means that any given input will be transformed into only one correct output. Thus, WebSplit is a stochastic, non-deterministic dataset.
On one hand, the industry may seem to be pursuing the correct path. After all, there are many grammatically correct ways to split a larger sentence. Therefore, it can even seem incorrect for a neural network’s training to assign a penalty cost to a grammatically correct split during training.
Yet, as will be made clear shortly, BSD intentionally causes the neural network training to assign a penalty cost to grammatically correct sentence splits. In fact, counterintuitively, BSD often requires the model’s loss function to assign a cost to the vast majority of grammatically correct splits.
BSD requires that there is only one unique output for each unique input. Assuming there are only 64 ways to split the above sentence, this means that 63 out of 64 splits will be deemed an error during training, even though they are grammatically correct. In terms of this sentence, that means 98% of the grammatically correct splits are counted as being errors.
If there are more than 64 grammatically correct splits, then more than 98% of the grammatically correct hits will be considered to be an error when training a neural network using BSD.
Thus, BSD Neural Network training is the opposite of the way that language models have been trained. The following section unveils BSD’s revolutionary training method.
BSD Neural Network’s Seven Criteria (Steps)
BSD NLP stands for Bounded-Scope Deterministic NLP. The NLP part of the name signifies that the input text must contain at least one human-language sentence. The BSD part is built on two aspects: bounded in scope, and deterministic. Bounded scope refers to the number of required transformations being small enough to be learned (e.g., small enough to achieve a zero cost value from the loss function during training). As for the determinism aspect of BSD, there are seven criteria:
1) There is only one unique output per unique input.
2) The unique output must be deterministically derived from the input text.
3) The selection of transformations that produce the output must be
deterministically derived from the input.
4) The selected transformations must be uniformly applied to all outputs.
5) Where the resulting output has multiple values, such that the order of the values can be changed without information loss, the order of the values must be sorted in a deterministic manner. Preferably, first positional occurrence sorting is used.
6) Where the deterministic selection of transformations can be null, there must be at least one input => output pair in which the inputs and corresponding outputs are identical in every respect. The inclusion of additional such pairs will reduce both the size of the neural network required and reduce the training time and cost.
7) Where selection counter examples exist, they must be provided in the input, and the corresponding outputs must be identical to the input.
Contrasting SOTA Sentence Splitting to BSD Neural Network Training
Training neural networks on WebSplit does not involve any of the above steps. Training neural networks on the rest of the SOTA datasets does not involve implementing criteria 2–6. Yet, as is explained below, steps 2, 3, and 4 are core criteria; and steps 5, 6, and 7 are conditional core criteria. Hence, SOTA training lacks all of the core criteria (at least in terms of SOTA sentence splitting).
The following explains how to train a neural network to accurately split larger sentences into smaller ones.
Consider a simple transformation (Transformation X): Remove the word ‘and’; if the next word is a noun, then add the same punctuation used at the end and capitalize the next word; if the next word is a verb, add the same punctuation used at the end, add the noun subject of the prior statement, capitalize the added noun subject.
On the surface, splitting a sentence on the word ‘and’ appears trivial. However, even Transformation X is insufficient to qualify as being deterministic. What if the noun subject is a nested noun phrase? What gets added to the beginning of the new split: the entire nested noun phrase, the complex noun phrase, the noun phrase, or the root noun phrase? Each implementation must make a deterministic choice, and apply that choice consistently.
An ideal implementation would use the entire noun phrase (including nesting) to ensure the preservation of meaning. Consider the following sentence: “The old man and woman sat on the bench.” Is the woman old too? The sentence can be read in two ways. Preserving the entire noun phrase (e.g. “old man and woman”) in splits ensures preservation of the original language intent—even when the intent is ambiguous.
Most importantly, this deterministic criterion means that there is only one correct choice for what gets added to the beginning of the new split. One correct choice, and only one. Everything else is an error when computing the loss function — regardless of whether it is grammatically correct or not. Adding this step to Transformation X results in Deterministic Transformation X.
Even though Deterministic Transformation X is only a very simple example of criteria 2 and 3, notice already that none of the SOTA training methods do either of these. In other words, even before introducing additional transformations, BSD NLP is already different from SOTA sentence splitting.
Consider step #2: deterministically derive the output from the input. WikiSplit annotators had a free hand in choosing where to split. They also freely added words of their own choosing. Thus, step #2 was not performed in the creation of the WikiSplit dataset. The other training datasets also gave the annotators a free hand on where to split, and the annotators also added words of their own choosing. Thus, none of them implemented step #2.
This is literally the opposite of Deterministic Transformation X. Notice how Deterministic Transformation X dictates the precise words that must be added (e.g., the entire noun phrase length of the subject noun phrase (including nesting)). That is the mirror opposite of allowing annotators to choose. In BSD NLP, the D means there are no choices during training. If the deterministic transformation has two or more viable alternatives, then it is not a deterministic transformation in the first place.
Consider step #3: deterministically choose the selected transformation based on the input. Once again, the creation of the SOTA datasets did not include this step. WikiSplit and BiSect always split the input into two sentences. This means that the annotator subjectively chooses whether to split a particular sentence on “and,” or “but,” or “wherein,” etc. There is no deterministic selection of transformation based on the input.
However, Deterministic Transformation X always results in one split for each ‘and’ that serves as a coordinating conjunction. If there is one such ‘and,’ then there is one split. If there are two such ‘ands,’ then there are two splits. And so forth.
The mere fact that WikiSplit and BiSect force the input into two splits further demonstrates that step #3 was not used (in addition to not using step #2). Likewise, the annotators of DeSSE were instructed to pick one to four splits of their own choosing from a list of recommended splits. Hence, DeSSE also did not implement step 2 nor step 3.
Just as step #2 is the mirror opposite of SOTA training, so too is step #3 another step that is mirror opposite of SOTA training.
Now consider step #4: The selected transformations must be uniformly applied to all outputs. As stated above, in regards to Deterministic Transformation X, the transformation must be applied every time the word ‘and’ serves as a coordinating conjunction. Also as stated above, none of the SOTA training sets uniformly applied even one transformation across the entire training set, thereby not implementing this step as well.
Backwards Premise of SOTA NLP Training
SOTA NLP training is based on the premise that neural networks learn intelligence, with the idea being that if the neural network is given a variety of correct ways to split a sentence, then it can learn to choose the best way for any given new sentence.
BSD NLP is based on the exact opposite premise, which is why the steps are literally the mirror opposite of SOTA training methods. BSD NLP is based on the premise that every choice introduced in the outputs adds a degree of error — not a degree of intelligence. The fundamental training premises could not be more different. Thus, it deserves to be repeated:
Every choice introduced in the outputs adds a degree of error—not a degree of intelligence. (per internal testing at Acurai)
If you take away nothing else from this article, you will be well served in confirming the above truth for yourself. After all, this the missing key to training neural networks to achieve 100% accuracy on natural language tasks.
The Need for Matching Input => Output Pairs
Now consider step #6: Where the deterministic selection of transformations can be null, there must be input => output pairs in which the inputs and corresponding outputs are identical in every respect.
Not all sentences need to be split. For example, where splitting is solely based on Deterministic Transformation X, then sentences that do not have the word ‘and’ should not be split. Therefore, the training data needs to contain examples of when not to split. That is the meaning of step #6 as it relates to sentence splitting.
Yet, notice that none of the SOTA training sets contain even one instance where the input remains the same. Unlike SOTA, BSD NLP says that neural networks do not learn intelligence, but rather they learn to perform the path of least resistance instead. Thus, the neural network needs to be told when to do nothing so that doing nothing is included in its learned path of least resistance.
Notice that Deterministic Transformation X makes an evaluation on the word ‘and.’ It evaluates whether the word is serving as a coordinating conjunction.
Consider the following sentence: “Tom and Mary walked into the house and sat down.” Only the second ‘and’ serves as a coordinating conjunction. The first ‘and’ does not.
Step #7 means that there should be counter example inputs for every evaluation made by the deterministic selectors.
In terms of transformation X, this simply means there needs to be inputs that include the word ‘and’ where ‘and’ is not being used as coordinating conjunction; and therefore, there is no split. Hence, the output equals the input.
Again, since all the datasets solely contain splits, they also do not implement step #7 either.
In short, there are two types of non-splits (i.e. two types of output = input): inputs where no transformation is even selected, and inputs where the selected transformation declines to perform the transformation due to one or more deterministic evaluations. The criteria in steps #6 and #7 define the types of inputs to include to produce a corresponding output that signifies that a transformation did not take place. Hence, an alternative output to accomplish the same thing can be to return a predefined value (such as “[BLANK]”) as the target output, as this accomplishes the criteria of signifying when a transformation did not take place.
Once the steps are understood, they can easily be applied to training a neural network on virtually any NLP task, including sentence splitting. And because the training is based on the inverse of SOTA methods, it produces profoundly different results. In fact, where all the steps are followed in producing the input / output pairs, the resulting BSD NLP Network can achieve 100% accuracy — a significant leap in performance over prior methods.
Target BSD Output
An ideal BSD NLP implementation will employ all seven criteria/steps. However, steps 2–4 are core BSD NLP criteria. Steps 5–7 are conditional core BSD NLP criteria (i.e., they are core components in NLP tasks that meet the stated condition of the criteria). Consider an training task in which a transformation selection can be null. For such a task, step #6 is a core component because of this condition.
An ideal implementation will include all core criteria, and it will include all conditional core components that match the conditions of the NLP task being trained. Such an implementation produces Perfect BSD Target Outputs from the corresponding training inputs.
While the combination of core criteria ensures 100% accuracy, some NLP tasks may only require implementing some of the core criteria to significantly improve accuracy — even to the point of 100% accuracy. Moreover, BSD criteria are so transformative that even applying them to part of a dataset can significantly improve performance.
BSD Target Output refers to implementing at least one core criteria for transforming inputs containing human-language sentences into deterministically transformed NLP output. Where all core criteria are applied, as well as all conditional core criteria that are applicable to the conditions of the implementation, the NLP deterministic transformation of such sentence-containing training input is called Perfect BSD Target Output.
BSD NLP First Positional Occurence Sorting
None of the sentence splitting datasets implement step #5 because it does not apply to splitting a complex sentence into multiple sentences. The task itself results in ordered output — in order to preserve the meaning of pronouns.
However, some NLP tasks can result in the output containing multiple values whose values can be presented in at least one different order while preserving all information. Such NLP tasks meet the condition of step #5, and therefore, ideal implementations would include step #5 to ensure 100% accuracy.
Moreover, ideal implementations will use first positional occurrence sorting. This simply means sorting the order of the values based on the order in which they first appear in the input.
For complex NLP tasks based on multiple steps, a separate first positional occurrence sorting can be applied at each step. This is explained immediately below.
Consider the task of extracting facts about people in a text. Here, the task may involve two levels (i.e., two steps): identify all people, and identify all facts in the input about each person.
When there are multiple levels of an NLP task, ideal BSD implementations use first positional occurrence sorting for each level. Consider a series of self-contained statements. Some statements are about Alice, and others are about Bob. Alice is mentioned first. However, some of the statements about Alice occur after Bob is mentioned.
One deterministic method is to use a one-pass first positional occurrence sorting across the dataset. Thus, the Alice and Bob extractions will occur left to right in a single pass. Thus, some of the Alice statements will indeed be included in the target output after some Bob extracted statements.
However, a multi-level first positional occurrence would allow the target output to be deterministically organized as: {name}:\nFact_1\nFact_2\n… In other words, the facts about each person are grouped together immediately after the person’s name.
Since this is a two-level task, a two-pass first positional occurrence sorting can be used. The sort order of the names is determined by the first pass. The order of the extracted facts is determined by the second pass. In this way, all of the statements regarding Alice and Bob are grouped together under their respective names while still preserving the requirement of deterministic first positional occurrence sorting.
As long as each name is selected in the order in which they appear in the text; and as long as the facts regarding each name are listed in the order they appear in the text; and as long as the extraction of the facts is done in a deterministic manner (e.g., preserving the facts verbatim), the BSD neural network can now extract grouped facts about people with 100% accuracy.
BSD Neural Network Training revolutionizes the use of neural networks for NLP and the NLP subfield of AI. It consistently results in 100% accuracy, even on complex language tasks.
At first blush, the preference of first positional occurrence sorting may seem insignificant. However, modern language models are built on token-based transformers. These transformers do not have any inherent awareness of the individual characters in the words they are processing. Hence, using alphabetical sorting would require increasing the size of the model many magnitudes (if such can even overcome the limitation). However, token-based transformers inherently possess positional awareness. By basing the sorting on position, the sorting is based on the inherent capabilities of the architecture, thereby allowing smaller models to achieve 100% accuracy.
Example Implementation of a BSD Neural Network
Note: The language in this section is lifted from a patent application. Hence, it has a formal legal tone. However, it’s included here because it can be very helpful for those unfamiliar with supervised neural network training.
BSD Target Output refers to a target output that is deterministically derived from a training input in accordance with the above criteria.
Figure 1 and Figure 2 illustrate an example embodiment of a BSD Neural Network. Figure 1 depicts example hardware.
Figure 1 shows a BSD neural network 100 (e.g., an NLP server) that includes a volatile storage 101 and a non-volatile storage 102 communicatively connected to a processor 103. The processor 103 is communicatively connected to a network controller 104 that communicatively connects the BSD neural network 100 to an external network 105.
Figure 2 depicts an example process flow for training a neural network.
The Training Inputs 200 contain at least one human language component. Training inputs are converted into numerical sequences (usually by tokenization) such as converting text to numerical tiktokens (as OpenAI does for its GPT models). Another popular method is to use SentencePiece to convert text into numerical sequences (as the Llama family of LLMs does). Any method for converting text into numerical sequences falls within the spirit and scope of this step. The numerical sequences are the actual input into the electronic Neural Network 202. Example neural networks include RNN, CNN, and transformer-based (such as GPT). Any supervised neural network can be used, provided that it supports training on text inputs and outputs. The training method depicted in Figure 2 can be applied to both seq2seq and autoregressive models. Those ordinarily skilled in the art know how to set up the supervised training of seq2seq, autoregressive, and other supervised neural networks. They also know how to choose the model architecture for the given NLP task at hand.
In seq2seq, each input 200 would be sent to the Neural Network. In autoregressive training, a sliding window would likely be used where each numerical token from the target output 205 is appended token-by-token to the input 200 to form another input; whereas the next token in the target output is the desired result in the given iteration. Those ordinarily skilled in the art know how to implement both seq2seq and autoregressive networks without further explanation.
For each iteration (i.e., epoch), the Loss Function 204 computes the difference between the output 203 of the Neural Network 202 and the corresponding BSD Target Output 205. It is this step where a Loss Function 204 uses BSD Target Outputs to compute the “loss” (or “cost”). It is this step where over 98% of grammatically correct sentence splits can be assigned a penalty cost during BSD NLP training on sentence splitting.
Embodiments can use Cross-Entropy Loss (Log Loss), KL Divergence, Reinforcement Learning, Contrastive Loss or any other loss methods. Any loss method that computes cost relative to the output of the Neural Network and at least one BSD Target Output is a novel innovation, and therefore, falls within the spirit and scope of this disclosure (where the BSD Target Output is a bounded-scope, deterministic transformation of the correlating Training Input).
Herein, for simplicity, Loss Function shall refer to loss functions known in the art, as well other measurements such as those used in reinforcement learning. While loss functions would typically be used for computing token-by-token differences in NLP neural networks (such as Large Language Models), Reward Signals could be used on a whole sequence basis and are therefore simply referred to as Loss Function herein. Thus, the term Loss Function is not meant to limit the seq2seq or token-by-token loss calculations chosen for any given embodiment. The limitation is that at least one BSD Target Output be used when computing such. This is the step that can transform the current art from 80% accuracy to literally 100% accuracy. This step can be applied to virtually any Low-Level NLP Neural Network to profoundly increase accuracy. Where a zero loss is eventually reached, the accuracy can literally be 100%.
If the loss during the iteration is less than or equal to the chosen threshold 206 then the training is done 207. The current state of the trained parameters allows for the Neural Network to accomplish its task with optimal accuracy. The state of the trained parameters can be stored in RAM, on disk, in the cloud, or via any other method (thereby allowing the model and its optimal parameters to be replicated on various devices). Moreover, the model with the optimized parameters can be saved as a whole to permanent storage.
Once the threshold has been reached, any input can now be sent to the Neural Network, and the output will be accurate (up to 100% accurate where a zero loss has been reached).
If the threshold has not been reached 206, then the trainable parameters are adjusted relative to the loss 201. Methods for adjusting the parameters (such as weights and biases) are well-known in the art (such as using back propagation and gradient descent with optimizers such as Adam and RMSProp). As previously stated, the innovative step of determining loss based on outputs that are bounded-scope, deterministic transformations of the input can profoundly improve the accuracy of a multitude of NLP Neural Networks. Alternatively, where the scope cannot be bounded, determining loss based on deterministic transformation of the input will profoundly improve accuracy (where deterministic transformation meets the novel criteria disclosed herein). Hence, such would still fall within the spirit and scope of this disclosure.
BSD for 100% Accurate Sentence Splitting
BSD revolutionizes the technological field of Natural Language Processing (NLP) by yielding 100% accuracy for low-level NLP tasks. Herein, BSD shall be used as shorthand for BSD NLP.
It bears noting that BSD training data can alternatively be used in few shot prompting in addition to or in lieu of being used for fine tuning. In fact, a 5-shot prompt using the following training data resulted in 0 hallucinations when simplifying 2,500 sentences from BBC articles.
A simple sentence splitting implementation could include splitting complex sentences based on coordinating clauses that start with the word “and” (or another coordinating conjunction such as “but,” “or,” “for,” “nor,” “yet,” or “so”). The transformation must also dictate under what deterministic conditions will words be added, and there must be a deterministic method for knowing precisely what words will be added (e.g., the entire subject noun phrase including nesting). In this situation, there is one objective transformation for converting each input into the target output, thereby satisfying the “determinism” aspect of BSD.
In regards to 100% accurate sentence splitting, consider the following input/output pairs:
Training Input: The cat sat on the chair and it was purring.
Target Output: The cat sat on the chair. It was purring.
Training Input: Tom drove home.
Target Output: Tom drove home.
The above is based on a single objective transformation of training input to target output. The sentences are split on the word ‘and’ where the word is being used as a coordinating clause, and where the word that follows the word ‘and’ is a noun phrase. Since sentence two does not have the word ‘and,’ no transformation is selected resulting in the target output being equal to the training input.
Now, consider another simple BSD implementation with multiple objective transformations. As a reminder, where multiple objective transformations exist, the selection of such transformation(s) must be deterministically derived from the input itself.
With this in mind, another implementation could include splitting complex sentences using two objective transformations. The first objective transformation (OT) could be to split on coordinating clauses that begin with the word ‘and’ whenever the following word is not a verb (Deterministic Transformation Y). The second OT could be to split on coordinating clauses that begin with the word ‘but’ whenever the following word is not a verb (Deterministic Transformation Z). The multiple OTs would result in deterministically producing the following input/output training pairs:
Training Input 1: The cat was sitting on the chair and it was purring.
Target Output 1: The cat was sitting on the chair. It was purring.
Training Input 2: The dog wanted the bone but it was out of reach.
Target Output 2: The dog wanted the bone. It was out of reach.
Training Input 3: The dog was sitting on the chair and it wanted the bone but it was out of reach.
Target Output 3: The dog was sitting on the chair. It wanted the bone. It was out of reach.
Training Input 4: Harry met Sally.
Target Output 4: Harry met Sally.
Training Input 5: Tom and Mary drove home.
Target Output 5: Tom and Mary drove home.
Training Input 6: But, he chose to come over.
Target Output 6: But, he chose to come over.
While such an implementation would require a larger neural network than the prior example, the number of learnable parameters would still be quite small compared to some of the most popular models in the art.
Notice also that the correct splitting may be one sentence (no splitting), two sentences, or even three sentences. Where objective transformations are applied, the number of output sentences can vary. In fact, splitting complex sentences can result in anywhere from one to a dozen (or even more) simpler sentences in certain implementations.
Notice how the entries conform to the criteria:
Pair 1: Selecting and Implementing Deterministic Transformation Y
Pair 2: Selecting and Implementing Deterministic Transformation Z
Pair 3: Selecting and Implementing Deterministic Transformation Y & Selecting and Implementing Deterministic Transformation Z
Pair 4: Null Selection of Transformations (i.e., no transformations selected)
Pair 5: Selecting and Declining Deterministic Transformation Y
Pair 6: Selecting and Declining Deterministic Transformation Z
Hence, Pair 5 is an example of step #6. Pairs 5 and 6 are examples of step #7.
Deterministic Transformation Y makes a deterministic evaluation based on the word ‘and.’ The determination is whether to implement the transformation or decline to do so. Therefore, the neural network needs a training entry for each of these scenarios (e.g., Pair 1 and Pair 5).
Likewise, Deterministic Transformation Z makes a similar deterministic evaluation on the word ‘but.’ Hence, the neural network needs an example of both scenarios (e.g., Pair 2 and Pair 6).
Thus, the seven steps/criteria guide the creation of entries for various deterministic decisions (e.g., Select and Implement Y, Select and Decline Y, Select and Implement Z, Select and Decline Z, null Selection (i.e., no Selection)). It is in this way that the path of least resistance equals performing the desired task with 100% accuracy.
Neural Networks Learn Path of Least Resistance—Not Intelligence
Neural networks take the path of least resistance during the training process. For example, a neural network trained to detect pneumonia in chest X-rays learned to focus on metadata or markers in the images rather than the actual lung features. This occurred because certain hospitals included different markers or annotations in their X-rays, and the model learned to correlate those with the presence of pneumonia.
As another example, a study showed that image classification models like convolutional neural networks (CNNs) trained on the ImageNet dataset tend to rely on texture rather than shape for classification. For example, a neural network might classify a picture of a cat-like object covered in “elephant skin texture” as an elephant. This preference for textures is easier to exploit than learning the shapes and semantics of objects.
Given the importance of this phenomena, consider a final example from dermatology image classification. Models trained to detect skin cancer have relied on artifacts such as rulers or measurement tools often included in malignant samples. A model learned to associate the presence of a ruler with malignancy, a clear shortcut that bypassed the need for true diagnostic reasoning.
The key is to make 100% accuracy the path of least resistance. Appying the above BSD steps accomplishes this.
Sophisticated Sentence Splitting
A more sophisticated sentence splitting machine can include a set of objective transformations based on both clauses and prepositions. It can even include rewriting words, provided that the rewriting is deterministic.
For example, when choosing to write noun phrases during sentence splitting, an objective transformation must choose whether to consistently use a noun phrase, a complete compound noun phrase, a complete nested noun phrase, etc. The same objective transformation is applied consistently throughout the training set.
Likewise, consistency may be applied in regards to person named entities. For example, the chosen objective transformation may use the full name, or the last name, or an abbreviation, etc., provided that such is applied consistently throughout the training set.
Consider the following complex sentence: “Tom Smith of Dallas and husband of Mary loves to barbecue and he enjoys drinking beer.”
If the objective transformation is based on noun phrase, there is only one correct split (and therefore, the correct split is objectively deterministic):
Tom Smith of Dallas and husband of Mary loves to barbecue. Tom Smith enjoys drinking beer.
Any other split would be incorrect.
If the objective transformation is based on complex noun phrases, there is only one correct split:
Tom Smith of Dallas and husband of Mary loves to barbecue. Tom Smith of Dallas enjoys drinking beer.
Any other split, including the prior example, would be incorrect.
If the objective transformation is based on nested noun phrases, there is only one correct split:
Tom Smith of Dallas and husband of Mary loves to barbecue. Tom Smith of Dallas and husband of Mary enjoys drinking beer.
Any other split would be incorrect, including the prior two examples.
The bolded, italic terms illustrate how the objective application of a deterministic transformation provides the consistency that the neural network needs in order to fully master the task.
While all three choices (and others) are linguistically correct, 100% accuracy comes from teaching the neural network one consistent objective. The current SOTA wrongly believes that neural networks will try to figure out the best alternative. BSD NLP is based on the correct understanding that neural networks do the opposite — they consistently look for the path of least resistance instead. Thus, BSD provides the path of least resistance to ensure the task is truly mastered.
This is the missing key over SOTA training.
There are no 64 correct alternatives for a given input as is the case for neural networks trained on WebSplit.
There are no variations of purportedly correct outputs caused by various annotators choosing different ways to split the sentences (e.g., one annotator uses noun phrases, another uses complex noun phrases, another sometimes uses nested noun phrases and other times leaves the pronoun alone, etc.).
There is no starting with subjective human summaries (as in the case of DeSSE).
There is no starting with non-deterministic sentence graphs.
BSD NLP is the literal opposite of SOTA NLP models that are based on the faulty premise that neural networks can learn to choose the best alternatives. For 100% accuracy, neural networks need to be trained on only one definitive, deterministic transformation for each potential input type. The rest of neural network training can proceed as usual.
BSD — Literally The Only Way to Achieve 100% Accuracy
BSD is literally the only way to train neural networks to achieve 100% accuracy on language tasks. How can I make such a bold statement? The model’s hallucination rate is proportional to the degree that the neural networks and other models deviate from BSD. The inverse is that the closer neural networks and models are to BSD, the greater their accuracy.
Consider LIMO (Less Is More For Reasoning) as a perfect case in point. While the researchers did not apply a deterministic transformation, they did apply a more-normalized transformation—thereby inadvertently moving the training closer to a BSD model. Because it is not deterministic, they did not achieve 100%. But the mere fact of normalization profoundly improved accuracy.
For example, the prior SOTA on AIME reasoning benchmark reasoning was 6.5% (using 100,000 training samples). Meanwhile, LIMO achieved a 57.1% (using only 817 samples). In other words, LIMO achieved a 778% gain while using 1% of the training data size.
The industry is beginning to move in the direction of BSD. And it will continue to do so, because the closer it gets the better the results.
In short, BSD is the only way to achieve 100% accuracy because any deviation from it introduces errors (i.e. hallucinations).
Dawn of 100% Accurate AI
Acurai has already confirmed that 100% accuracy of BSD three times over.
The results of the BSD Sentence Splitting test have already been discussed above.
Acurai also tested BSD Summarization. More specifically, we wanted to compare the results of BSD to the challenges that Apple was facing in regards to creating headline summaries of BBC News Articles and such.
For those unfamiliar, BBC News filed a formal complaint with Apple regarding hallucinations in its automated summary headlines.
For example, one headline read: “Brazilian tennis player, Rafael Nadal, comes out as gay.” This short headline includes three hallucinations:
The story was about Joao Lucas Reis da Silva (not Rafael Nadal).
Rafael is not Brazilian.
Rafael has not come out as gay.
As another example.
No, Luigi Mangione did not shoot himself.
As another example.
No, Netanyahu was not arrested.
The summarization issue plagued other Apple services as well, including messaging summarization.
For example, Andrew Schmidt’s mother texted: “That hike almost killed me!” However, the summary that Schmidt first saw was a notice that his mom attempted suicide.
Summarization Method #1
The first method we tried was as follows:
Use BSD Sentence Simplification => BSD Coreference Resolution to create Formatted Facts (FFs) from the article.
Ask the LLM to choose the Formatted Fact that most represented the overall article.
While this approach led to 100% hallucination free summarization, it suffered from suboptimal relevance. The LLM was not capable of choose the most optimal FF.
Summarization Method #2
The second method we tried was as follows:
Standardize the article using a Spelling/Grammar Correction model.
Then use BSD Sentence Simplification => BSD Coreference Resolution to create Formatted Facts (FFs) from the article.
Then ask the LLM to create its own one-sentence summary.
Then use vector and index-based searching to locate the FF that was most similar to the sentence produced by the LLM.
This achieved 100% hallucination-free summarization that was also very relevant.
100% Hallucination Elimination on RAGTruth for GPT-4 and GTP-3.5 Turbo
I have previously written about Acurai’s 100% hallucination elimination on RAGTruth for GPT-4 and GPT-3.5 Turbo.
This was accomplished used Formatted Facts. This present article discloses how to perform BSD Sentence Simplification. The next article is going to teach how to perform BSD Coreference Resolution. Thus, you will know how Acurai produces Formatted Facts—step by step—with full transparency.
Acurai’s Methods Fully Revealed
I currently serve as the Chief Technology Officer at Acurai Inc. Acurai is shorthand for Accurate AI. Our mission is to deliver 100% accurate AI across various NLP tasks and knowledge domains.
I have received permission to share Acurai’s proprietary methods. Perhaps these methods can inspire you to develop new ones. However, if you want to use Acurai’s methods (or a derivation of them), it’s important to contact Acurai for permission to do so.
Also, if you want to learn Acurai’s proprietary methods prior to the publication of future articles, I encourage you to go straight to Acurai’s patent application.