100% Accurate AI Step by Step

Author: mwood

100% Accurate AI Step-by-Step (Part One): BSD Neural Networks

Do you want to train neural networks to achieve 100% accuracy on virtually every language task — including 100% accurate chatbot responses? Bound-Scope Deterministic (BSD) Neural Networks are what you’ve been looking for. BSD is all you need.

With the BSD training method, neural networks achieve the same precision with language that deterministic programming does with numbers, thereby solving virtually every natural language processing (NLP) task all at once. This includes:

  • 100% accurate low-level NLP tasks, such as Sentence Splitting and Named Entity Recognition.
  • 100% accurate high-level NLP tasks, such as Summarization and Coreference Resolution.
  • 100% accurate LLM tasks, such as 100% hallucination-free Question/Answering and even lengthy exposition.

This article teaches broader skill of BSD neural network training by showing how to use it for the NLP task of Sentence Splitting. Future articles will teach how to use BSD for 100% accurate coreference resolution, 100% accurate summarization, and even 100% accurate chatbot responses.

BSD neural network training is the foundation for it all.

Formatted Facts and the Discovery of BSD

I discovered BSD Neural Network training when working on the issue of converting text into Formatted Facts (FFs). Formatted Facts are the cornerstone of 100% accurate chatbot responses.

On the surface, creating formatted facts seems simple:

  • First, split complex sentences into simpler ones.
  • Second, apply coreference resolution to the simple sentences.

This pipeline transforms text into independent, self-contained statements (which my company calls “Formatted Facts”).

But this simple pipeline had a very big problem. There was no reliable method for splitting sentences; nor was there any reliable method for performing coreference resolution.

I currently work at Acurai Inc. At Acurai we have a saying, “Less Broken is Still Broken.” In other words, a chatbot that answers questions correctly 80% of the time is still “broken.”

To date, researchers have been pursuing less broken, but still broken, methods for natural language processing (NLP) tasks such as sentence splitting and coreference resolution. For example:

This is problematic on two fronts. First, as data passes through a pipeline, the errors of each component multiply. Thus, even this simple two-step pipeline has a 33% error rate caused by the compounding of errors.

Second, State-of-the-Art (SOA) Sentence splitting was achieved by fine tuning LLMs. In other words, even fine-tuned LLMs have an approximately 20% error rate for Sentence Splitting. (See link above.) Thus, LLMs are not powerful enough, in and of themselves.

BSD — Fundamental Building Block of AGI

I find it odd that chatbot makers keep hyping “AGI” when LLMs cannot even accurately split complex sentences into simpler ones; nor can LLMs even count the number of “R’s” in raspberry.

LLMs were notoriously mocked for not being able to count the number of R’s in strawberry. Therefore, they were eventually fine tuned to be able to do so. But that didn’t teach them how to count letters in other words (such as raspberry). Consider o1:

It may be possible that o1 has been fine tuned at this point for raspberry. But let’s face it, fine tuning models on a per question basis is not any form of “intelligence.” One can simply create a large database to unintelligently retrieve the answers in that case.

Artificial General Intelligence (AGI) is going to need to be able to access information in real time, and be able to digest the facts contained in the information with 100% accuracy. Thus, solving the 100% accurate Formatted Facts (FFs) Pipeline is an essential step towards AGI.

I invented BSD Neural Networks to solve the 100% accurate Formatted Facts (FFs) Pipeline. However, I quickly realized that BSD Neural Network training was what AI enthusiasts have been looking for all along. In addition to 100% accurate sentence splitting and coreference resolution, BSD Neural networks can be used for 100% accurate named entity recognition, parts-of-speech tagging, document summarization, and even 100% accurate chatbot responses.

Not to mention that in doing so, BSD Neural Networks are the essential building blocks of reasoning and AGI. In fact, future articles will disclose the secrets to reliable AI reasoning (with BSD Neural Networks as the guaranteers of the reliability).

Sentence Splitting as Fact Extraction

BSD would be a remarkable discovery even if it only resulted in 100% accurate sentence splitting. After all, data scientists have been pursuing accurate sentence splitting for over 55 years. The quest for accurate Sentence Splitting has gone from rule-based approaches (1960s-1970s), to statistical approaches (1980s-1990s), to machine learning (2000s-2010s), to deep learning and neural networks (2010s-present).

Sentence splitting has long been recognized as a key component of fact extraction. Consider the following two real-world sentences:

  • The last 4 kilometres (2.5 mi) of the remaining original _Reichsautobahn_, a section of A 11 northeast of Berlin near Gartz built in 1936 — the westernmost remainder of the never-finished Berlinka — was scheduled for replacement around 2015.\[_needs update_\] Roadway condition is described as “deplorable”; the 25 metres (82 ft)\-long concrete slabs, too long for proper expansion, are cracking under the weight of the traffic as well as the weather.

Too many people use run on sentences—including Wikipedia authors and news’ journalists (and including this present author as well :-)). Meanwhile, LLMs struggle with complex sentences—both when pre-training parametric knowledge, and when such content is sent as input in RAG-based implementations.

Now consider the above sentence after being split by a fine-tuned BSD Neural Network using the same step-by-step method disclosed below:

  • The last 4 kilometres (2.5 mi) of the remaining original _Reichsautobahn_ was scheduled for replacement around 2015.
  • \[_needs update_\].
  • The last 4 kilometres of the remaining original _Reichsautobahn_ is a section of A 11.
  • The section of A 11 is northeast of Berlin.
  • The section of A 11 is near Gartz.
  • The section of A 11 was built in 1936.
  • The section of A 11 is the westernmost remainder of the never-finished Berlinka.
  • Roadway condition is described as “deplorable”.
  • The 25 metres (82 ft)-long concrete slabs are too long for proper expansion.
  • The 25 metres (82 ft)-long concrete slabs are cracking under the weight of the traffic.
  • The 25 metres (82 ft)-long concrete slabs are cracking under the weather.

Now it is trivial for LLMs to accurately answer questions.

AI Accuracy Plateau

This article teaches BSD using Sentence Splitting for a number of reasons. One of which is to demonstrate why NLP tasks have accuracy plateaus and how to permanently fix the plateau issue.

It’s no secret that AI has hit an accuracy plateau in terms of extractive question and answering. Thus, AI powerhouses have moved on to focus on images & video, mathematical & scientific reasoning, and code generation. Consider OpenAI’s very impressive strides in its newest image generator, and its current $3 billion pursuit of acquiring Windsurf.

However, little press is given to the fact that the major AI players have moved past trying to solve the extractive QA accuracy plateau. Nevertheless, TechCruch recently noted “OpenAI’s new reasoning AI models hallucinate more.” TechCrunch states:

According to OpenAI’s internal tests, o3 and o4-mini, which are so-called reasoning models, hallucinate more often than the company’s previous reasoning models — o1, o1-mini, and o3-mini — as well as OpenAI’s traditional, “non-reasoning” models, such as GPT-4o.

Perhaps more concerning, the ChatGPT maker doesn’t really know why it’s happening.

In its technical report for o3 and o4-mini, OpenAI writes that “more research is needed” to understand why hallucinations are getting worse as it scales up reasoning models. O3 and o4-mini perform better in some areas, including tasks related to coding and math. But because they “make more claims overall,” they’re often led to make “more accurate claims as well as more inaccurate/hallucinated claims,” per the report.

For now, OpenAI is currently not focusing on eliminating extractive QA hallucinations.

BSD — Key to Breaking through the AI Accuracy Plateau

SOTA methods for extracting facts from complex sentences hit an 80% accuracy ceiling — a seemingly insurmountable plateau. As the SOTA researchers stated:

The results are shown in Table 2. In all tables, best results are shown in bold. In both cases, scores increased across metrics as training size increased, although by smaller increments from the 3K mark upwards. Best results are split between the two largest training datasets on both BiSECT and DeSSE, indicative of a potential ceiling in terms of improvements from training data augmentation. It might also be the case that the observed plateauing resulted from a lack of variety in the added training data, but verifying this hypothesis was beyond the scope of this work.

The researchers discovered that accuracy did not improve as more training data was added. They reached a plateau.

Importantly, the researchers hypothesized that the plateau could be solved by adding variety to the dataset. This hypothesis is not only wrong, it is actually backwardsAs you will see below, the problem is due to variety. Thus, variety isn’t the solution, it’s the problem.

Fortunately, this bold statement can be empirically demonstrated—showing that the industry-wide assumption on neural network training is literally backwards.

5-Entry BSD Dataset Outperforms 1-Million-Entry SOTA Datasets

To empirically demonstrate that BSD is the missing key, a 5-entry BSD dataset was tested. It not only produced a result unheard of in the field—but the 5-entry dataset literally produced zero errors.

As data scientists often say: “The proof is in the pudding.” I couldn’t agree more. Now, here’s the pudding.

The SOTA datasets used to train neural networks for Sentence Splitting and Rephrasing are:

  • DeSSE
  • BiSECT
  • WikiSplit
  • WebSplit.

BiSect has 928,440 entries. WebSplit has 1,331,515 entries.

Despite having datasets with one million training examples, SOTA Sentence Splitting hit an accuracy ceiling of approximately 80%. Meanwhile, a 5-entry BSD set achieved 100% accuracy on a much more stringent test than was used to assess the SOTA method.

You read that correctly. A 5-entry BSD dataset outperforms a 1-million-entry dataset. Moreover, the BSD dataset demonstrated 100% accuracy as well. BSD truly is the revolution that the AI industry has been searching for.

Method: A 5-Sample BSD Dataset was used in few shot prompting to split sentences from 500 BBC news articles.

Result: The BSD method split the sentences with 100% accuracy.

Meanwhile, the SOTA method first fine tunes an LLM on up to 1 million examples, and then prompts the LLM to split sentences — resulting in an 18.4% error rate on narrative text (the same type of text used in BBC news articles).

The 5-sample few shot prompt demonstrates the remarkable breakthrough of BSD. Naturally, fine tuning models on a larger BSD dataset is recommended to ensure 100% accuracy on an ongoing basis. But the fact that a 5-sample few shot prompt results in 100% accuracy on a stringent test shows BSD to be the correct way to produce perfectly reliable results.

Current Method of Training Language Models is Literally Backwards

5 BSD entries significantly outperformed neural networks trained on over one million other types of entries. So what is the “secret”? The secret is that variety is the source of hallucinations—not the answer to them.

Each BSD dataset entry is structured in a very specific manner that communicates to the neural network precisely what it needs to learn to do. This has been the missing key to 100% accurate NLP neural networks. At least in terms of using supervised training to teach neural networks to perform natural language processing (NLP) tasks.

Supervised neural network training consists of providing the model multiple input => output pairs. The dataset tells the model what the output should be for any given input. The goal is for the model to discover the patterns that exist across the dataset; so that it can then transform inputs that it has never seen before into the desired outputs.

However, the industry has been training language-based neural networks using stochastic, non-deterministic methods. For example, the WebSplit dataset provides many grammatically correct outputs for each input. Consider the following sentence:

  • Input: “Auburn is part of Lee County in Alabama which is situated within the state of Alabama in the United States where one of the ethnic groups in the United States are the African Americans.”

The WebSplit dataset contains 64 alternative splits for this sentence alone. In other words, there are 64 entries in the dataset where the input is this same sentence. However, each of the 64 outputs provides one grammatically correct alternative for splitting that sentence. Hence, for this one sentence, there are 64 input => output pairs, where each output gives an alternative correct split.

In other words, there are 64 entries in the dataset where the above sentence is the input. Each of the 64 entries gives one acceptable output. For example, three of the output examples include:

  • Output: “Auburn is part of Lee County in Alabama . Lee County is situated within the state of Alabama . Alabama is in the United States . One of the ethnic groups in the United States are the African Americans .”
  • Output: “Auburn is part of Lee County , Alabama in the United States . African Americans are an ethnic group within the United States .”
  • Output: “Auburn , Alabama is part of Lee County . Lee County is in the state of Alabama . Alabama is in the United States . African Americans are an ethnic group within the United States .”

And so on. That’s 64 entries. Each entry has the same input. Each entry has a different output.

Notice that this is the opposite of determinism. Determinism, by definition, means that any given input will be transformed into only one correct output. Thus, WebSplit is a stochastic, non-deterministic dataset.

On one hand, the industry may seem to be pursuing the correct path. After all, there are many grammatically correct ways to split a larger sentence. Therefore, it can even seem incorrect for a neural network’s training to assign a penalty cost to a grammatically correct split during training.

Yet, as will be made clear shortly, BSD intentionally causes the neural network training to assign a penalty cost to grammatically correct sentence splits. In fact, counterintuitively, BSD often requires the model’s loss function to assign a cost to the vast majority of grammatically correct splits.

BSD requires that there is only one unique output for each unique input. Assuming there are only 64 ways to split the above sentence, this means that 63 out of 64 splits will be deemed an error during training, even though they are grammatically correct. In terms of this sentence, that means 98% of the grammatically correct splits are counted as being errors.

If there are more than 64 grammatically correct splits, then more than 98% of the grammatically correct hits will be considered to be an error when training a neural network using BSD.

Thus, BSD Neural Network training is the opposite of the way that language models have been trained. The following section unveils BSD’s revolutionary training method.

BSD Neural Network’s Seven Criteria (Steps)

BSD NLP stands for Bounded-Scope Deterministic NLP. The NLP part of the name signifies that the input text must contain at least one human-language sentence. The BSD part is built on two aspects: bounded in scope, and deterministic. Bounded scope refers to the number of required transformations being small enough to be learned (e.g., small enough to achieve a zero cost value from the loss function during training). As for the determinism aspect of BSD, there are seven criteria:

  • 1) There is only one unique output per unique input.
  • 2) The unique output must be deterministically derived from the input text.
  • 3) The selection of transformations that produce the output must be
    deterministically derived from the input.
  • 4) The selected transformations must be uniformly applied to all outputs.
  • 5) Where the resulting output has multiple values, such that the order of the values can be changed without information loss, the order of the values must be sorted in a deterministic manner. Preferably, first positional occurrence sorting is used.
  • 6) Where the deterministic selection of transformations can be null, there must be at least one input => output pair in which the inputs and corresponding outputs are identical in every respect. The inclusion of additional such pairs will reduce both the size of the neural network required and reduce the training time and cost.
  • 7) Where selection counter examples exist, they must be provided in the input, and the corresponding outputs must be identical to the input.

Contrasting SOTA Sentence Splitting to BSD Neural Network Training

Training neural networks on WebSplit does not involve any of the above steps. Training neural networks on the rest of the SOTA datasets does not involve implementing criteria 2–6. Yet, as is explained below, steps 2, 3, and 4 are core criteria; and steps 5, 6, and 7 are conditional core criteria. Hence, SOTA training lacks all of the core criteria (at least in terms of SOTA sentence splitting).

The following explains how to train a neural network to accurately split larger sentences into smaller ones.

Consider a simple transformation (Transformation X): Remove the word ‘and’; if the next word is a noun, then add the same punctuation used at the end and capitalize the next word; if the next word is a verb, add the same punctuation used at the end, add the noun subject of the prior statement, capitalize the added noun subject.

On the surface, splitting a sentence on the word ‘and’ appears trivial. However, even Transformation X is insufficient to qualify as being deterministic. What if the noun subject is a nested noun phrase? What gets added to the beginning of the new split: the entire nested noun phrase, the complex noun phrase, the noun phrase, or the root noun phrase? Each implementation must make a deterministic choice, and apply that choice consistently.

An ideal implementation would use the entire noun phrase (including nesting) to ensure the preservation of meaning. Consider the following sentence: “The old man and woman sat on the bench.” Is the woman old too? The sentence can be read in two ways. Preserving the entire noun phrase (e.g. “old man and woman”) in splits ensures preservation of the original language intent—even when the intent is ambiguous.

Most importantly, this deterministic criterion means that there is only one correct choice for what gets added to the beginning of the new split. One correct choice, and only one. Everything else is an error when computing the loss function — regardless of whether it is grammatically correct or not. Adding this step to Transformation X results in Deterministic Transformation X.

Even though Deterministic Transformation X is only a very simple example of criteria 2 and 3, notice already that none of the SOTA training methods do either of these. In other words, even before introducing additional transformations, BSD NLP is already different from SOTA sentence splitting.

Consider step #2: deterministically derive the output from the input. WikiSplit annotators had a free hand in choosing where to split. They also freely added words of their own choosing. Thus, step #2 was not performed in the creation of the WikiSplit dataset. The other training datasets also gave the annotators a free hand on where to split, and the annotators also added words of their own choosing. Thus, none of them implemented step #2.

This is literally the opposite of Deterministic Transformation X. Notice how Deterministic Transformation X dictates the precise words that must be added (e.g., the entire noun phrase length of the subject noun phrase (including nesting)). That is the mirror opposite of allowing annotators to choose. In BSD NLP, the D means there are no choices during training. If the deterministic transformation has two or more viable alternatives, then it is not a deterministic transformation in the first place.

Consider step #3: deterministically choose the selected transformation based on the input. Once again, the creation of the SOTA datasets did not include this step. WikiSplit and BiSect always split the input into two sentences. This means that the annotator subjectively chooses whether to split a particular sentence on “and,” or “but,” or “wherein,” etc. There is no deterministic selection of transformation based on the input.

However, Deterministic Transformation X always results in one split for each ‘and’ that serves as a coordinating conjunction. If there is one such ‘and,’ then there is one split. If there are two such ‘ands,’ then there are two splits. And so forth.

The mere fact that WikiSplit and BiSect force the input into two splits further demonstrates that step #3 was not used (in addition to not using step #2). Likewise, the annotators of DeSSE were instructed to pick one to four splits of their own choosing from a list of recommended splits. Hence, DeSSE also did not implement step 2 nor step 3.

Just as step #2 is the mirror opposite of SOTA training, so too is step #3 another step that is mirror opposite of SOTA training.

Now consider step #4: The selected transformations must be uniformly applied to all outputs. As stated above, in regards to Deterministic Transformation X, the transformation must be applied every time the word ‘and’ serves as a coordinating conjunction. Also as stated above, none of the SOTA training sets uniformly applied even one transformation across the entire training set, thereby not implementing this step as well.

Backwards Premise of SOTA NLP Training

SOTA NLP training is based on the premise that neural networks learn intelligence, with the idea being that if the neural network is given a variety of correct ways to split a sentence, then it can learn to choose the best way for any given new sentence.

BSD NLP is based on the exact opposite premise, which is why the steps are literally the mirror opposite of SOTA training methods. BSD NLP is based on the premise that every choice introduced in the outputs adds a degree of error — not a degree of intelligence. The fundamental training premises could not be more different. Thus, it deserves to be repeated:

Every choice introduced in the outputs adds a degree of error—not a degree of intelligence. (per internal testing at Acurai)

If you take away nothing else from this article, you will be well served in confirming the above truth for yourself. After all, this the missing key to training neural networks to achieve 100% accuracy on natural language tasks.

The Need for Matching Input => Output Pairs

Now consider step #6: Where the deterministic selection of transformations can be null, there must be input => output pairs in which the inputs and corresponding outputs are identical in every respect.

Not all sentences need to be split. For example, where splitting is solely based on Deterministic Transformation X, then sentences that do not have the word ‘and’ should not be split. Therefore, the training data needs to contain examples of when not to split. That is the meaning of step #6 as it relates to sentence splitting.

Yet, notice that none of the SOTA training sets contain even one instance where the input remains the same. Unlike SOTA, BSD NLP says that neural networks do not learn intelligence, but rather they learn to perform the path of least resistance instead. Thus, the neural network needs to be told when to do nothing so that doing nothing is included in its learned path of least resistance.

Notice that Deterministic Transformation X makes an evaluation on the word ‘and.’ It evaluates whether the word is serving as a coordinating conjunction.

Consider the following sentence: “Tom and Mary walked into the house and sat down.” Only the second ‘and’ serves as a coordinating conjunction. The first ‘and’ does not.

Step #7 means that there should be counter example inputs for every evaluation made by the deterministic selectors.

In terms of transformation X, this simply means there needs to be inputs that include the word ‘and’ where ‘and’ is not being used as coordinating conjunction; and therefore, there is no split. Hence, the output equals the input.

Again, since all the datasets solely contain splits, they also do not implement step #7 either.

In short, there are two types of non-splits (i.e. two types of output = input): inputs where no transformation is even selected, and inputs where the selected transformation declines to perform the transformation due to one or more deterministic evaluations. The criteria in steps #6 and #7 define the types of inputs to include to produce a corresponding output that signifies that a transformation did not take place. Hence, an alternative output to accomplish the same thing can be to return a predefined value (such as “[BLANK]”) as the target output, as this accomplishes the criteria of signifying when a transformation did not take place.

Once the steps are understood, they can easily be applied to training a neural network on virtually any NLP task, including sentence splitting. And because the training is based on the inverse of SOTA methods, it produces profoundly different results. In fact, where all the steps are followed in producing the input / output pairs, the resulting BSD NLP Network can achieve 100% accuracy — a significant leap in performance over prior methods.

Target BSD Output

An ideal BSD NLP implementation will employ all seven criteria/steps. However, steps 2–4 are core BSD NLP criteria. Steps 5–7 are conditional core BSD NLP criteria (i.e., they are core components in NLP tasks that meet the stated condition of the criteria). Consider an training task in which a transformation selection can be null. For such a task, step #6 is a core component because of this condition.

An ideal implementation will include all core criteria, and it will include all conditional core components that match the conditions of the NLP task being trained. Such an implementation produces Perfect BSD Target Outputs from the corresponding training inputs.

While the combination of core criteria ensures 100% accuracy, some NLP tasks may only require implementing some of the core criteria to significantly improve accuracy — even to the point of 100% accuracy. Moreover, BSD criteria are so transformative that even applying them to part of a dataset can significantly improve performance.

BSD Target Output refers to implementing at least one core criteria for transforming inputs containing human-language sentences into deterministically transformed NLP output. Where all core criteria are applied, as well as all conditional core criteria that are applicable to the conditions of the implementation, the NLP deterministic transformation of such sentence-containing training input is called Perfect BSD Target Output.

BSD NLP First Positional Occurence Sorting

None of the sentence splitting datasets implement step #5 because it does not apply to splitting a complex sentence into multiple sentences. The task itself results in ordered output — in order to preserve the meaning of pronouns.

However, some NLP tasks can result in the output containing multiple values whose values can be presented in at least one different order while preserving all information. Such NLP tasks meet the condition of step #5, and therefore, ideal implementations would include step #5 to ensure 100% accuracy.

Moreover, ideal implementations will use first positional occurrence sorting. This simply means sorting the order of the values based on the order in which they first appear in the input.

For complex NLP tasks based on multiple steps, a separate first positional occurrence sorting can be applied at each step. This is explained immediately below.

Consider the task of extracting facts about people in a text. Here, the task may involve two levels (i.e., two steps): identify all people, and identify all facts in the input about each person.

When there are multiple levels of an NLP task, ideal BSD implementations use first positional occurrence sorting for each level. Consider a series of self-contained statements. Some statements are about Alice, and others are about Bob. Alice is mentioned first. However, some of the statements about Alice occur after Bob is mentioned.

One deterministic method is to use a one-pass first positional occurrence sorting across the dataset. Thus, the Alice and Bob extractions will occur left to right in a single pass. Thus, some of the Alice statements will indeed be included in the target output after some Bob extracted statements.

However, a multi-level first positional occurrence would allow the target output to be deterministically organized as: {name}:\nFact_1\nFact_2\n… In other words, the facts about each person are grouped together immediately after the person’s name.

Since this is a two-level task, a two-pass first positional occurrence sorting can be used. The sort order of the names is determined by the first pass. The order of the extracted facts is determined by the second pass. In this way, all of the statements regarding Alice and Bob are grouped together under their respective names while still preserving the requirement of deterministic first positional occurrence sorting.

As long as each name is selected in the order in which they appear in the text; and as long as the facts regarding each name are listed in the order they appear in the text; and as long as the extraction of the facts is done in a deterministic manner (e.g., preserving the facts verbatim), the BSD neural network can now extract grouped facts about people with 100% accuracy.

BSD Neural Network Training revolutionizes the use of neural networks for NLP and the NLP subfield of AI. It consistently results in 100% accuracy, even on complex language tasks.

At first blush, the preference of first positional occurrence sorting may seem insignificant. However, modern language models are built on token-based transformers. These transformers do not have any inherent awareness of the individual characters in the words they are processing. Hence, using alphabetical sorting would require increasing the size of the model many magnitudes (if such can even overcome the limitation). However, token-based transformers inherently possess positional awareness. By basing the sorting on position, the sorting is based on the inherent capabilities of the architecture, thereby allowing smaller models to achieve 100% accuracy.

Example Implementation of a BSD Neural Network

Note: The language in this section is lifted from a patent application. Hence, it has a formal legal tone. However, it’s included here because it can be very helpful for those unfamiliar with supervised neural network training.

BSD Target Output refers to a target output that is deterministically derived from a training input in accordance with the above criteria.

Figure 1 and Figure 2 illustrate an example embodiment of a BSD Neural Network. Figure 1 depicts example hardware.

Figure 1 shows a BSD neural network 100 (e.g., an NLP server) that includes a volatile storage 101 and a non-volatile storage 102 communicatively connected to a processor 103. The processor 103 is communicatively connected to a network controller 104 that communicatively connects the BSD neural network 100 to an external network 105.

Figure 2 depicts an example process flow for training a neural network.

The Training Inputs 200 contain at least one human language component. Training inputs are converted into numerical sequences (usually by tokenization) such as converting text to numerical tiktokens (as OpenAI does for its GPT models). Another popular method is to use SentencePiece to convert text into numerical sequences (as the Llama family of LLMs does). Any method for converting text into numerical sequences falls within the spirit and scope of this step. The numerical sequences are the actual input into the electronic Neural Network 202. Example neural networks include RNN, CNN, and transformer-based (such as GPT). Any supervised neural network can be used, provided that it supports training on text inputs and outputs. The training method depicted in Figure 2 can be applied to both seq2seq and autoregressive models. Those ordinarily skilled in the art know how to set up the supervised training of seq2seq, autoregressive, and other supervised neural networks. They also know how to choose the model architecture for the given NLP task at hand.

In seq2seq, each input 200 would be sent to the Neural Network. In autoregressive training, a sliding window would likely be used where each numerical token from the target output 205 is appended token-by-token to the input 200 to form another input; whereas the next token in the target output is the desired result in the given iteration. Those ordinarily skilled in the art know how to implement both seq2seq and autoregressive networks without further explanation.

For each iteration (i.e., epoch), the Loss Function 204 computes the difference between the output 203 of the Neural Network 202 and the corresponding BSD Target Output 205. It is this step where a Loss Function 204 uses BSD Target Outputs to compute the “loss” (or “cost”). It is this step where over 98% of grammatically correct sentence splits can be assigned a penalty cost during BSD NLP training on sentence splitting.

Embodiments can use Cross-Entropy Loss (Log Loss), KL Divergence, Reinforcement Learning, Contrastive Loss or any other loss methods. Any loss method that computes cost relative to the output of the Neural Network and at least one BSD Target Output is a novel innovation, and therefore, falls within the spirit and scope of this disclosure (where the BSD Target Output is a bounded-scope, deterministic transformation of the correlating Training Input).

Herein, for simplicity, Loss Function shall refer to loss functions known in the art, as well other measurements such as those used in reinforcement learning. While loss functions would typically be used for computing token-by-token differences in NLP neural networks (such as Large Language Models), Reward Signals could be used on a whole sequence basis and are therefore simply referred to as Loss Function herein. Thus, the term Loss Function is not meant to limit the seq2seq or token-by-token loss calculations chosen for any given embodiment. The limitation is that at least one BSD Target Output be used when computing such. This is the step that can transform the current art from 80% accuracy to literally 100% accuracy. This step can be applied to virtually any Low-Level NLP Neural Network to profoundly increase accuracy. Where a zero loss is eventually reached, the accuracy can literally be 100%.

If the loss during the iteration is less than or equal to the chosen threshold 206 then the training is done 207. The current state of the trained parameters allows for the Neural Network to accomplish its task with optimal accuracy. The state of the trained parameters can be stored in RAM, on disk, in the cloud, or via any other method (thereby allowing the model and its optimal parameters to be replicated on various devices). Moreover, the model with the optimized parameters can be saved as a whole to permanent storage.

Once the threshold has been reached, any input can now be sent to the Neural Network, and the output will be accurate (up to 100% accurate where a zero loss has been reached).

If the threshold has not been reached 206, then the trainable parameters are adjusted relative to the loss 201. Methods for adjusting the parameters (such as weights and biases) are well-known in the art (such as using back propagation and gradient descent with optimizers such as Adam and RMSProp). As previously stated, the innovative step of determining loss based on outputs that are bounded-scope, deterministic transformations of the input can profoundly improve the accuracy of a multitude of NLP Neural Networks. Alternatively, where the scope cannot be bounded, determining loss based on deterministic transformation of the input will profoundly improve accuracy (where deterministic transformation meets the novel criteria disclosed herein). Hence, such would still fall within the spirit and scope of this disclosure.

BSD for 100% Accurate Sentence Splitting

BSD revolutionizes the technological field of Natural Language Processing (NLP) by yielding 100% accuracy for low-level NLP tasks. Herein, BSD shall be used as shorthand for BSD NLP.

It bears noting that BSD training data can alternatively be used in few shot prompting in addition to or in lieu of being used for fine tuning. In fact, a 5-shot prompt using the following training data resulted in 0 hallucinations when simplifying 2,500 sentences from BBC articles.

A simple sentence splitting implementation could include splitting complex sentences based on coordinating clauses that start with the word “and” (or another coordinating conjunction such as “but,” “or,” “for,” “nor,” “yet,” or “so”). The transformation must also dictate under what deterministic conditions will words be added, and there must be a deterministic method for knowing precisely what words will be added (e.g., the entire subject noun phrase including nesting). In this situation, there is one objective transformation for converting each input into the target output, thereby satisfying the “determinism” aspect of BSD.

In regards to 100% accurate sentence splitting, consider the following input/output pairs:

  • Training Input: The cat sat on the chair and it was purring.
    Target Output: The cat sat on the chair. It was purring.
  • Training Input: Tom drove home.
    Target Output: Tom drove home.

The above is based on a single objective transformation of training input to target output. The sentences are split on the word ‘and’ where the word is being used as a coordinating clause, and where the word that follows the word ‘and’ is a noun phrase. Since sentence two does not have the word ‘and,’ no transformation is selected resulting in the target output being equal to the training input.

Now, consider another simple BSD implementation with multiple objective transformations. As a reminder, where multiple objective transformations exist, the selection of such transformation(s) must be deterministically derived from the input itself.

With this in mind, another implementation could include splitting complex sentences using two objective transformations. The first objective transformation (OT) could be to split on coordinating clauses that begin with the word ‘and’ whenever the following word is not a verb (Deterministic Transformation Y). The second OT could be to split on coordinating clauses that begin with the word ‘but’ whenever the following word is not a verb (Deterministic Transformation Z). The multiple OTs would result in deterministically producing the following input/output training pairs:

  • Training Input 1: The cat was sitting on the chair and it was purring.
    Target Output 1: The cat was sitting on the chair. It was purring.
  • Training Input 2: The dog wanted the bone but it was out of reach.
    Target Output 2: The dog wanted the bone. It was out of reach.
  • Training Input 3: The dog was sitting on the chair and it wanted the bone but it was out of reach.
    Target Output 3: The dog was sitting on the chair. It wanted the bone. It was out of reach.
  • Training Input 4: Harry met Sally.
    Target Output 4: Harry met Sally.
  • Training Input 5: Tom and Mary drove home.
    Target Output 5: Tom and Mary drove home.
  • Training Input 6: But, he chose to come over.
    Target Output 6: But, he chose to come over.

While such an implementation would require a larger neural network than the prior example, the number of learnable parameters would still be quite small compared to some of the most popular models in the art.

Notice also that the correct splitting may be one sentence (no splitting), two sentences, or even three sentences. Where objective transformations are applied, the number of output sentences can vary. In fact, splitting complex sentences can result in anywhere from one to a dozen (or even more) simpler sentences in certain implementations.

Notice how the entries conform to the criteria:

  • Pair 1: Selecting and Implementing Deterministic Transformation Y
  • Pair 2: Selecting and Implementing Deterministic Transformation Z
  • Pair 3: Selecting and Implementing Deterministic Transformation Y & Selecting and Implementing Deterministic Transformation Z
  • Pair 4: Null Selection of Transformations (i.e., no transformations selected)
  • Pair 5: Selecting and Declining Deterministic Transformation Y
  • Pair 6: Selecting and Declining Deterministic Transformation Z

Hence, Pair 5 is an example of step #6. Pairs 5 and 6 are examples of step #7.

Deterministic Transformation Y makes a deterministic evaluation based on the word ‘and.’ The determination is whether to implement the transformation or decline to do so. Therefore, the neural network needs a training entry for each of these scenarios (e.g., Pair 1 and Pair 5).

Likewise, Deterministic Transformation Z makes a similar deterministic evaluation on the word ‘but.’ Hence, the neural network needs an example of both scenarios (e.g., Pair 2 and Pair 6).

Thus, the seven steps/criteria guide the creation of entries for various deterministic decisions (e.g., Select and Implement Y, Select and Decline Y, Select and Implement Z, Select and Decline Z, null Selection (i.e., no Selection)). It is in this way that the path of least resistance equals performing the desired task with 100% accuracy.

Neural Networks Learn Path of Least Resistance—Not Intelligence

Neural networks take the path of least resistance during the training process. For example, a neural network trained to detect pneumonia in chest X-rays learned to focus on metadata or markers in the images rather than the actual lung features. This occurred because certain hospitals included different markers or annotations in their X-rays, and the model learned to correlate those with the presence of pneumonia.

As another example, a study showed that image classification models like convolutional neural networks (CNNs) trained on the ImageNet dataset tend to rely on texture rather than shape for classification. For example, a neural network might classify a picture of a cat-like object covered in “elephant skin texture” as an elephant. This preference for textures is easier to exploit than learning the shapes and semantics of objects.

Given the importance of this phenomena, consider a final example from dermatology image classification. Models trained to detect skin cancer have relied on artifacts such as rulers or measurement tools often included in malignant samples. A model learned to associate the presence of a ruler with malignancy, a clear shortcut that bypassed the need for true diagnostic reasoning.

I appear to be the first to have realized this same form of self-organization found in image-based CNNs also occurs in transformer-based language models. Most importantly, I realized that this phenomenon can be transformed from being a problem into being the key to producing smaller models that are profoundly more accurate than larger models 10–100 times their size (even more accurate than models 1,000 times their size).

The key is to make 100% accuracy the path of least resistance. Appying the above BSD steps accomplishes this.

Sophisticated Sentence Splitting

A more sophisticated sentence splitting machine can include a set of objective transformations based on both clauses and prepositions. It can even include rewriting words, provided that the rewriting is deterministic.

For example, when choosing to write noun phrases during sentence splitting, an objective transformation must choose whether to consistently use a noun phrase, a complete compound noun phrase, a complete nested noun phrase, etc. The same objective transformation is applied consistently throughout the training set.

Likewise, consistency may be applied in regards to person named entities. For example, the chosen objective transformation may use the full name, or the last name, or an abbreviation, etc., provided that such is applied consistently throughout the training set.

Consider the following complex sentence: “Tom Smith of Dallas and husband of Mary loves to barbecue and he enjoys drinking beer.”

If the objective transformation is based on noun phrase, there is only one correct split (and therefore, the correct split is objectively deterministic):

  • Tom Smith of Dallas and husband of Mary loves to barbecue. Tom Smith enjoys drinking beer.

Any other split would be incorrect.

If the objective transformation is based on complex noun phrases, there is only one correct split:

  • Tom Smith of Dallas and husband of Mary loves to barbecue. Tom Smith of Dallas enjoys drinking beer.

Any other split, including the prior example, would be incorrect.

If the objective transformation is based on nested noun phrases, there is only one correct split:

  • Tom Smith of Dallas and husband of Mary loves to barbecue. Tom Smith of Dallas and husband of Mary enjoys drinking beer.

Any other split would be incorrect, including the prior two examples.

The bolded, italic terms illustrate how the objective application of a deterministic transformation provides the consistency that the neural network needs in order to fully master the task.

While all three choices (and others) are linguistically correct, 100% accuracy comes from teaching the neural network one consistent objective. The current SOTA wrongly believes that neural networks will try to figure out the best alternative. BSD NLP is based on the correct understanding that neural networks do the opposite — they consistently look for the path of least resistance instead. Thus, BSD provides the path of least resistance to ensure the task is truly mastered.

This is the missing key over SOTA training.

  • There are no 64 correct alternatives for a given input as is the case for neural networks trained on WebSplit.
  • There are no variations of purportedly correct outputs caused by various annotators choosing different ways to split the sentences (e.g., one annotator uses noun phrases, another uses complex noun phrases, another sometimes uses nested noun phrases and other times leaves the pronoun alone, etc.).
  • There is no starting with subjective human summaries (as in the case of DeSSE).
  • There is no starting with non-deterministic sentence graphs.

BSD NLP is the literal opposite of SOTA NLP models that are based on the faulty premise that neural networks can learn to choose the best alternatives. For 100% accuracy, neural networks need to be trained on only one definitive, deterministic transformation for each potential input type. The rest of neural network training can proceed as usual.

BSD — Literally The Only Way to Achieve 100% Accuracy

BSD is literally the only way to train neural networks to achieve 100% accuracy on language tasks. How can I make such a bold statement? The model’s hallucination rate is proportional to the degree that the neural networks and other models deviate from BSD. The inverse is that the closer neural networks and models are to BSD, the greater their accuracy.

Consider LIMO (Less Is More For Reasoning) as a perfect case in point. While the researchers did not apply a deterministic transformation, they did apply a more-normalized transformation—thereby inadvertently moving the training closer to a BSD model. Because it is not deterministic, they did not achieve 100%. But the mere fact of normalization profoundly improved accuracy.

For example, the prior SOTA on AIME reasoning benchmark reasoning was 6.5% (using 100,000 training samples). Meanwhile, LIMO achieved a 57.1% (using only 817 samples). In other words, LIMO achieved a 778% gain while using 1% of the training data size.

The industry is beginning to move in the direction of BSD. And it will continue to do so, because the closer it gets the better the results.

In short, BSD is the only way to achieve 100% accuracy because any deviation from it introduces errors (i.e. hallucinations).

Dawn of 100% Accurate AI

Acurai has already confirmed that 100% accuracy of BSD three times over.

The results of the BSD Sentence Splitting test have already been discussed above.

Acurai also tested BSD Summarization. More specifically, we wanted to compare the results of BSD to the challenges that Apple was facing in regards to creating headline summaries of BBC News Articles and such.

For those unfamiliar, BBC News filed a formal complaint with Apple regarding hallucinations in its automated summary headlines.

For example, one headline read: “Brazilian tennis player, Rafael Nadal, comes out as gay.” This short headline includes three hallucinations:

  • The story was about Joao Lucas Reis da Silva (not Rafael Nadal).
  • Rafael is not Brazilian.
  • Rafael has not come out as gay.

As another example.

No, Luigi Mangione did not shoot himself.

As another example.

No, Netanyahu was not arrested.

The summarization issue plagued other Apple services as well, including messaging summarization.

For example, Andrew Schmidt’s mother texted: “That hike almost killed me!” However, the summary that Schmidt first saw was a notice that his mom attempted suicide.

Summarization Method #1

The first method we tried was as follows:

  • Use BSD Sentence Simplification => BSD Coreference Resolution to create Formatted Facts (FFs) from the article.
  • Ask the LLM to choose the Formatted Fact that most represented the overall article.

While this approach led to 100% hallucination free summarization, it suffered from suboptimal relevance. The LLM was not capable of choose the most optimal FF.

Summarization Method #2

The second method we tried was as follows:

  • Standardize the article using a Spelling/Grammar Correction model.
  • Then use BSD Sentence Simplification => BSD Coreference Resolution to create Formatted Facts (FFs) from the article.
  • Then ask the LLM to create its own one-sentence summary.
  • Then use vector and index-based searching to locate the FF that was most similar to the sentence produced by the LLM.

This achieved 100% hallucination-free summarization that was also very relevant.

100% Hallucination Elimination on RAGTruth for GPT-4 and GTP-3.5 Turbo

I have previously written about Acurai’s 100% hallucination elimination on RAGTruth for GPT-4 and GPT-3.5 Turbo.

This was accomplished used Formatted Facts. This present article discloses how to perform BSD Sentence Simplification. The next article is going to teach how to perform BSD Coreference Resolution. Thus, you will know how Acurai produces Formatted Facts—step by step—with full transparency.

Acurai’s Methods Fully Revealed

I currently serve as the Chief Technology Officer at Acurai Inc. Acurai is shorthand for Accurate AI. Our mission is to deliver 100% accurate AI across various NLP tasks and knowledge domains.

I have received permission to share Acurai’s proprietary methods. Perhaps these methods can inspire you to develop new ones. However, if you want to use Acurai’s methods (or a derivation of them), it’s important to contact Acurai for permission to do so.

Also, if you want to learn Acurai’s proprietary methods prior to the publication of future articles, I encourage you to go straight to Acurai’s patent application.

The Insanity of Relying on Vector Embeddings: Why RAG Fails

In RAG, the goal is to locate the stored information that has the highest percentage of sameness to the provided query. Vector similarity search does not do thisThat’s why RAG fails.

Wrong Tool for the Job

RAG fails in production because vector embeddings are the wrong choice for determining percentage of sameness. This is easily demonstrated. Consider the following three words:

  • King
  • Queen
  • Ruler

King and ruler can refer to the same person (and are thus considered synonyms). But king and queen are distinctly different people. From the perspective of percentage of sameness, king/ruler should have a high score and king/queen should be literally zero.

In other words, if the query is asking something about a “king” then chunks discussing a “queen” would be irrelevant; but chunks discussing a “ruler” might be relevant. Yet, vector embeddings consider “queen” to be more relevant to a search on “king” than “ruler.” Here are the vector similarity scores for queen and ruler when compared to king using OpenAI’s ADA-002 embeddings:

  • King
  • Queen: 92%
  • Ruler: 83%

When asking for information regarding a king, passages with information regarding a queen will take precedence over passages regarding a ruler; even though the queen passages cannot be relevant at all.

Vector Embeddings are Wrong for: Who, What, When, Where, and How Questions

The vector embedding problem does not only occur with words referring to people (such as king), it also occurs with words referring to things.

Consider a query asking about the traits of a cat. Passages discussing dogs should have a score of zero in regards to percentage of sameness; and passages dealing with felines should have an extremely high score. Yet, once again, vector embeddings get this wrong:

  • Cat
  • Dog: 86%
  • Feline: 85%

Even though the scores are 1% different, this still means that passages regarding dogs take precedence over passages regarding felines; even though the dog passages have zero relevance and the feline passages are extremely relevant.

The vector embedding issue isn’t even just confined to people and things, but also affects searches regarding time.

Consider a question regarding the 1900s. From a percentage of sameness standpoint, passages regarding the 1700s should be zero percent, and passages regarding the 20th century should literally be 100% (as ‘1900s’ and ‘20th century’ are interchangeably the same). Yet, once again, vector embeddings misrepresent degree of sameness:

  • 1900s
  • 1700s: 91%
  • 20th century: 89%

Notice that the 1700s are considered strongly more similar (despite a literal 0% relevancy) compared to the 20th century (despite it being literally the exact same thing as 1900s).

Words that mean the exact same thing are called absolute synonyms or perfect synonyms. Yet, even in regards to absolute synonyms, vector embeddings give priority to things that are not even synonyms at all—as the following example further demonstrates.

“The Big Apple” is a direct reference to New York City. Now consider Susan, a New Jersey resident who wrote a slew of blog posts regarding the restaurants, museums, and other places she visits in her home state. However, one of Susan’s posts states that she got married in “The Big Apple.” A visitor to Susan’s website asks the chatbot: “Has Susan ever been to New York?”

Unfortunately, the numerous entries regarding New Jersey would take precedence over Susan’s marriage posting. Why? From a vector embedding perspective, “New Jersey” is more semantically similar to “New York” than “The Big Apple” is:

  • New York
  • New Jersey: 90%
  • The Big Apple: 89%

Depending on the number of postings involving “New Jersey,” the reference to “The Big Apple” might not be included even if the chatbot requests hundreds of potential candidates. Thus, vector embeddings can fail regarding locations (e.g. New York), just as it can for people (e.g. Kings), things (e.g. cat) and time (e.g. 1900s).

In fact, vector embeddings can fail for instructions as well.

  • bake a cake
  • bake a pie: 93%
  • make a chocolate cake: 92%

Consider a query asking how to “bake a cake.” Passages that discuss “bake a pie” (93% score) will take precedence over passages stating “make a chocolate cake” (92% score); even though the former is completely irrelevant and the latter is directly relevant.

The above examples show that vector similarity is not a reliable measurement of percentage of sameness. In fact, it is not reliable for people (king), things (cat), times (1900s), locations (New York), or even instructions (bake a cake). In other words, vector embeddings do not reliably measure percentage of sameness for questions regarding who (people), what (things), when (times), where (location), and “how to” (instructions). Said another way, vector embeddings are fundamentally flawed for virtually every type of question that a person can ask.

Query Context will not Save You

Critics of an earlier version of this article unanimously shout “context matters.” They argue that the similarity of individual words somehow doesn’t matter because the context of the query somehow resolves everything.

First, these critics completely ignored all the studies detailed below. The studies on OP-RAG, KG-RAG, RankRAG, LongRAG, etc. document that the query context does not magically resolve the math.

Second, these critics need to take the time to apply the same math above to multiple words. This is a study that I have personally conducted. If they did so, they would see that the math gets worse as more words are added, not better. Most especially if a keyword in the query is paired with the wrong semantically similar word.

As one example, ChatGPT-4 used to give the wrong mother of Afonso II. Instead, ChatGPT-4 gave the mother of Alfonso VII (an entirely different person). The reason is that both Afonso and Alfonso are semantically similar (even though they are 0% the same). More importantly, ChatGPT-4 gave the wrong answer because of query context. Consider the following query: “Who was the mother of Afonso II, the third king of Portugal?”

  • In the training data, the word “mother” is found close to the word Alfonso.
  • There was no word “mother” close to the word Afonso in the training data.

Therefore, the context of “mother” caused ChatGPT to overlook the fact that Afonso II and Alfonso VII are entirely different people. The query context made the matter worse, not better. For a more detailed explanation of the Afonso Debacle, see the link to my tutorial at the end of this article.

OpenAI has since fine-tuned the Afonso answer, just as it does with other public hallucinations, only making ChatGPT even worse.

The same goes for vector embeddings by themselves. If the same training data was used to provide chunks for RAG, the RAG-based chatbot would give the same result. “mother” + “Alfonso” has greater vector similarity to the query than “Afonso” alone.

  • mother of Afonso
  • mother of Alfonso: 93%
  • Afonso: 90%

Thus, the query context only made things worse, not better.

What RAG Traditionalists are not Telling You

Perhaps you may wonder if the above examples are cherry picked. Or perhaps you may wonder if the percentage scores don’t actually matter. So let’s take a look at what RAG enthusiasts aren’t telling you by comparing the gaslighting presentation of RAG vs how RAG actually works.

  • Gaslighting Presentation of RAG: Store the vector embeddings of millions of chunks in a vector database. Get the vector embedding of the user’s query. Using cosine similarity, find the top three matching chunks and send them to the LLM with the query. This is a “fast, accurate, and scalable” solution (quote from a leading AI author whose company has taught over 400,000 people — see below).
  • How State-of-the-Art RAG Actually Works: Load vectors for thousands of documents into a vector database. Retrieve almost 50,000 characters of chunks to send to the LLM along with the query, resulting in an unreliable chatbot (e.g. an F1 score lower than 50).

Consider the release of OP-RAG on September 3, 2024.

OP-RAG is the work of three Nvidia researchers. Thus, this study comes from reputable researchers.

Also, the results in the above chart are regarding the EN.QA dataset. Here are the first two questions in that dataset:

  • when is the last episode of season 8 of the walking dead
  • in greek mythology who was the goddess of spring growth

Thus, the answers are short. They do not require lengthy exposition. Moreover, the dataset consists of only 3.8% of the larger Wikipedia corpus.

Yet, even with all the resources of Nvidia, a relatively modest dataset size, and relatively short answers, the researchers broke the prior state-of-the-art with a new RAG method that achieved a 47.25 F1 score by sending 48K of chunks along with the query (as sending less results in an even lower F1 score).

Did these Nvidia researchers fail to get the memo that they should’ve been able to store more than 25 times the amount of vectors, and consistently find the relevant answer in the top three matches? Of course they did not. That’s not how RAG works in the real world. Also see Nvidia’s LongRAG released on November 1, 2024 as another perfect case in point.

Advanced RAG Won’t Save You

I’m writing this article because of the many forum posts I see where data scientists and programmers believe they are doing something wrong. Usually, some well-intentioned person will throw out a myriad of things to try: reranking, query rewriting, BM25, Knowledge Graphs, etc. Throwing everything against the wall hoping that something sticks.

Reranking

Reranking is perhaps the most recommended Advanced RAG strategy. However, as the RankRAG study shows, even using a fine-tuned model for reranking only results in a 54.2 score on EN.QA. Using general reranking models had an even worse score.

GraphRAG and Knowledge Graphs

A recent study on KG-RAG (RAG enhanced with Knowledge Graphs) showed an F1 score of 25% and an accuracy of 32% for CWQ dataset. Interestingly, Knowledge Graph RAG had a lower accuracy than regular embedding RAG (which had a 46% accuracy).

As for Microsoft’s GraphRAG, Microsoft itself admits that it only achieves a level equal to naive RAG! As stated by Microsoft: “Results show that GraphRAG achieves a similar level of faithfulness to baseline RAG.” “As baseline RAG in this comparison we use LangChain’s Q&A” (aka naive RAG). See “GraphRAG: Unlocking LLM discovery on narrative private data”.

Keyword Hybrid Search

Even adding BM-25 keyword search and/or Hyde and/or summarization still results in an average score less than 0.50 across benchmarks.

Source: “Searching for Best Practices in Retrieval-Augmented Generation

The combination of various Advance RAG search methods resulted in a top average score of 0.446. However, even this level of “accuracy” is impractical in real-world chatbots. In the study, the mere combination of BM-25 + Hyde took 11.71 seconds per query.

Real-World vs Hype

There simply is no study showing that vector embeddings, combined with dozens of Advanced RAG techniques, results in a reliable chatbot in production environments containing numerous documents. Moreover, the added latency of many Advanced RAG techniques makes them impractical for real-world chatbots—irrespective of the accuracy issue.

Bigger LLMs Won’t Save You

Consider the Databricks study released in October 2024.

In order to get over 80% correctness, RAG needed to send 64K characters of chunks to OpenAI’s o1. None of the other models reached 80%, including GPT-4o, GPT-4 Turbo, and Claude-3.5 Sonnet. Yet, there are numerous problems with the o1 results.

First, the hallucination rate is still too high.

Second, o1 is extremely slow even when processing short contexts. Processing 64K of context is unbearably slow.

Third, o1 is expensive to run.

To top it all off, word on the street is that the latest batch of upcoming models fail to deliver any significant improvement over already released models—with Anthropic even indefinitely delaying the release of any new model.

But even if larger models could overcome the problem, they would be slower and more expensive. In other words, they’d be too slow and too expensive for any practical purpose. Would companies pay more for a chatbot than for a person, when the chatbot would require up to a minute for each unreliable answer?

That’s the actual state of RAG. That’s the actual outcome of relying on vector embeddings.

It’s Not You. It’s Them.

The problem is that what is being taught to hundreds of thousands of people is patently untrue. The following is from a book updated in October 2024, written by cofounders of a company that has taught over 400,000 people:

RAG is best suited for scenarios where you need to process large datasets that cannot fit within a single LLM context window and when fast response times and low latency are necessary. …

Nowadays, a RAG system has a standard architecture already implemented in popular frameworks, so developers don’t have to reinvent the wheel. …

Once the data is converted into embeddings, vector databases can quickly find similar items because similar items are represented by vectors close to each other in the vector space, which we refer to as a vector store (storing vectors). Semantic search, which searches within vector stores, understands the meaning of a query by comparing its embedding with the embeddings of the stored data. This ensures that the search results are relevant and match the intended meaning, regardless of the specified words used in the query or the type of data being searched.

As the math shows, vector embeddings do not find items based on percentage of sameness. They do not understand the meaning of a query. They most certainly do not “ensure” that search results are “relevant” even with the simplest queries, let alone “regardless of the specified words used in the query or the type of data being searched.”

As the research paper on OP-RAG shows, even with 400 chunks retrieved via vector searching, the LLM can fail to find relevant information more than 50% of the time on the most simple of benchmarks. Nevertheless, data scientists are taught in textbooks: “In a real-world project, one might upload a whole website or course to Deep Lake [vector database] to search across thousands or millions of documents. … To generate a response, we retrieve the top-k (e.g. top-3) chunks most similar to the user’s question, format the prompt, and send it to the model at 0 temperature.”

Textbooks currently teach students that vector embeddings are so powerful that they can store “millions of documents” and then find the relevant answer to queries in the “top-3” chunks. Again, the math and cited research studies show this to be patently untrue.

The Road to 100% Accurate Responses

The answer to the problem is to stop relying on vector embeddings.

Does this mean that vector embeddings are useless? No! Not at all! They have a very important use in Natural Language Processing (NLP).

For example, vector embeddings are a powerful tool to use with words that have multiple meanings. Consider the word ‘glasses’ as an example. This word can refer to drinking glasses and eyewear glasses (among other things).

Now consider the following query: What type of glasses does Julia Roberts wear? Vector embeddings will help ensure that chunks regarding eyeglasses will be above chunks that refer to drinking glasses. That’s where their semantic power lies.

The launch of ChatGPT brought about a rather unfortunate shift in the data science community. Important NLP tools such as the use of synonyms, hyponyms, hypernyms, holonyms, and more were set aside in favor of Chatbot queries.

There is no doubt that LLMs obviated some parts of NLP. But we are currently in the stage where the data science community has thrown out the proverbial baby with the bathwater.

LLMs and vector embeddings are the missing piece of the NLP puzzle. They are not the entire picture in and of themselves.

For example, companies have long noticed that visitors leave their sites when chatbots don’t provide the product listings they are looking for. Therefore, companies tried replacing their keyword-based search with synonym-based search.

The synonym-based search did find products that the keyword-based search could not. But it came at a price. Words that have multiple meanings often caused irrelevant information to drown out that which the visitor wanted. For example, a visitor looking for drinking glasses might get a lot of listings presenting eyewear glasses instead.

Yet, rather than throw the whole thing out, this is where vector embeddings can come to the rescue. Rather than relying on vector embeddings, they can be used as a refinement instead. Rely on the synonym-based search, and then use vector embeddings to get the relevant listings to the top.

Once you have the relevant listings, methods such as those disclosed in the Acurai research paper can be used to produce 100% accurate, hallucination-free responses. These methods will soon be included in my series on Eliminating Hallucinations.

I’ll also be adding a section on RAG — including novel search methods that, when combined, rapidly pinpoint relevant sentences and sections from within millions of documents. The retrieved information can then be converted to Fully-Formatted Facts for 100% accurate, hallucination-free responses.

I’ll hopefully have time to add to the series over the holidays. However, I’ve been wanting to write this present article for a long time, due to the number of people who feel they are somehow failing to implement what they’ve been taught. For now my message is short: It’s not you. It’s them.

100% Accurate AI Claimed by Acurai — OpenAI and Anthropic Confirm Acurai’s Discoveries

Acurai’s audacious claims to have discovered how LLMs operate are now confirmed by studies conducted by OpenAI and Anthropic.

In March 2024, this present author published “Eliminate Chatbot Hallucinations — Yes, Eliminate Them.” This article made the audacious claim that LLMs self-organize around Noun Phrases; and that the behavior of LLMs can be controlled through Noun Phrase manipulation. Recent studies by Anthropic and OpenAI now confirm these to be empirically truths. This is wonderful news! After all, these truths are the basis for eliminating hallucinations — yes, eliminating them.

Noun-Phrase Dominance Model

In March 2024, I presented the revolutionary discovery of the “Noun-Phrase Dominance Model”:

This present inventor’s Noun-Phrase Collision Model led to the development of the higher-level Noun-Phrase Dominance Model — the model that is the key to using LLM token prediction to consistently generate factually accurate output. The Noun-Phrase Dominance Model is perhaps best understood from the perspective of another type of neural network — CNNs (Convolutional Neural Networks).

CNNs are often used for image identification. For example, CNNs can be trained to distinguish images of people, pets, boats, etc. CNNs consist of multiple layers of neurons. Remarkable, during training, these layers self-organize themselves. For example, the early layers self-organize around detecting simple patterns such as edges and textures. The latter layers selforganize by combining the information from earlier layers into more complex patterns like shapes — shapes including the recognition of eyes, ears, legs, steering wheels, etc.

No one tells the CNN to do this. Even though CNNs are merely a collection of neurons with probabilistic weights and biases, CNNs automatically self-organize in this manner in order to fulfill the training objective. While much is discussed in the literature regarding the selforganizing nature of CNN neural networks, little if anything is discussed regarding the selforganizing nature of Transformer Neural Networks — the type of neural network used to construct the most popular Large Language Models such as ChatGPT.

This present inventor’s Noun-Phrase Dominance Model states that neural networks self organize around noun phrases during the training of Large Language Models.

(emphasis in original)

The article then discusses controlling LLM behavior (e.g. ensuring 100% accurate responses) by manipulating the noun phrases that are sent in the query and passages in RAG-based chatbots.

Anthropic and OpenAI Studies Now Confirm Noun-Phrase Dominance Model

LLMs are constructed from multiple layers. In other words, the input (prompt) passes through many layers to generate the output.

Each layer contains many neurons. Each neuron has various values it has learned during training (such as weights and biases).

The Noun-Phrase Dominance model says that neurons don’t operate on their own, but rather self organize around noun phrases. Both OpenAI and Anthropic recently discovered this to be the empirical truth—the actual way that LLMs operate under the hood.

As reported by Axios AI+ on August 23, 2024:

One way AI researchers are trying to understand how models work is by looking at the combinations of artificial neurons that are activated in an AI model’s neural network when a user enters an input.

These combinations, referred to as “features,” relate to different places, people, objects and concepts.

Researchers at Anthropic used this method to map a layer of the neural network inside its Claude Sonnet model and identified different features for people (Albert Einstein, for example) or concepts such as “inner conflict.”

They found that some features are located near related terms: For example, the “inner conflict” feature is near features related to relationship breakups, conflicting allegiances and the notion of a catch-22.

When the researchers manipulated features, the model’s responses changed, opening up the possibility of using features to steer a model’s behavior.

OpenAI similarly looked at a layer near the end of its GPT-4 network and found 16 million features, which are “akin to the small set of concepts a person might have in mind when reasoning about a situation,” the company said in a post about the work.

[Bolded added]

First, notice that Anthropic and OpenAI now confirm that neurons do indeed self organize—just as the Noun-Phrase Dominance Model stated.

Second, notice that the self-organization is not around verbs, adjectives, adverbs, etc. In stark contrast, the neurons self organize around “places, people, objects and concepts.” In other words, the neurons self organize around noun phrases—just as the Noun-Phrase Dominance Model stated.

Third, noun phrase groupings (i.e. features) cluster “near related terms” affirming the existence of Noun-Phrase Routes—just as the Noun-Phrase Dominance Model stated.

Fourth, notice that Anthropic and OpenAI found that manipulating noun phrases can be used to “steer a model’s behavior”—just as the Noun-Phrase Dominance Model stated.

Eliminate Hallucinations—Yes, Eliminate Them

This is remarkable news. After all, the Noun-Phrase Dominance Model is the key to eliminating hallucinations. However, the research community has somehow ignored this model—all while continuing to proclaim hallucinations to be an intractable issue at the same.

Since the March 2024 article, I created a video that uses real-world demonstrations to document the Noun-Phrase Dominance Model, and explains how this is the key to building 100% accurate, hallucination-free, chatbots.

The Noun-Phrase Dominance Model is real. And so is the solution to finally eliminating hallucinations once and for all.

You can build 100% accurate chatbots … today.

Powered by WordPress & Theme by Anders Norén