Do you want to train neural networks to achieve 100% accuracy on virtually every language task — including 100% accurate chatbot responses? Bound-Scope Deterministic (BSD) Neural Networks are what you’ve been looking for. BSD is all you need.

With the BSD training method, neural networks achieve the same precision with language that deterministic programming does with numbers, thereby solving virtually every natural language processing (NLP) task all at once. This includes:

  • 100% accurate low-level NLP tasks, such as Sentence Splitting and Named Entity Recognition.
  • 100% accurate high-level NLP tasks, such as Summarization and Coreference Resolution.
  • 100% accurate LLM tasks, such as 100% hallucination-free Question/Answering and even lengthy exposition.

This article teaches broader skill of BSD neural network training by showing how to use it for the NLP task of Sentence Splitting. Future articles will teach how to use BSD for 100% accurate coreference resolution, 100% accurate summarization, and even 100% accurate chatbot responses.

BSD neural network training is the foundation for it all.

Formatted Facts and the Discovery of BSD

I discovered BSD Neural Network training when working on the issue of converting text into Formatted Facts (FFs). Formatted Facts are the cornerstone of 100% accurate chatbot responses.

On the surface, creating formatted facts seems simple:

  • First, split complex sentences into simpler ones.
  • Second, apply coreference resolution to the simple sentences.

This pipeline transforms text into independent, self-contained statements (which my company calls “Formatted Facts”).

But this simple pipeline had a very big problem. There was no reliable method for splitting sentences; nor was there any reliable method for performing coreference resolution.

I currently work at Acurai Inc. At Acurai we have a saying, “Less Broken is Still Broken.” In other words, a chatbot that answers questions correctly 80% of the time is still “broken.”

To date, researchers have been pursuing less broken, but still broken, methods for natural language processing (NLP) tasks such as sentence splitting and coreference resolution. For example:

This is problematic on two fronts. First, as data passes through a pipeline, the errors of each component multiply. Thus, even this simple two-step pipeline has a 33% error rate caused by the compounding of errors.

Second, State-of-the-Art (SOA) Sentence splitting was achieved by fine tuning LLMs. In other words, even fine-tuned LLMs have an approximately 20% error rate for Sentence Splitting. (See link above.) Thus, LLMs are not powerful enough, in and of themselves.

BSD — Fundamental Building Block of AGI

I find it odd that chatbot makers keep hyping “AGI” when LLMs cannot even accurately split complex sentences into simpler ones; nor can LLMs even count the number of “R’s” in raspberry.

LLMs were notoriously mocked for not being able to count the number of R’s in strawberry. Therefore, they were eventually fine tuned to be able to do so. But that didn’t teach them how to count letters in other words (such as raspberry). Consider o1:

It may be possible that o1 has been fine tuned at this point for raspberry. But let’s face it, fine tuning models on a per question basis is not any form of “intelligence.” One can simply create a large database to unintelligently retrieve the answers in that case.

Artificial General Intelligence (AGI) is going to need to be able to access information in real time, and be able to digest the facts contained in the information with 100% accuracy. Thus, solving the 100% accurate Formatted Facts (FFs) Pipeline is an essential step towards AGI.

I invented BSD Neural Networks to solve the 100% accurate Formatted Facts (FFs) Pipeline. However, I quickly realized that BSD Neural Network training was what AI enthusiasts have been looking for all along. In addition to 100% accurate sentence splitting and coreference resolution, BSD Neural networks can be used for 100% accurate named entity recognition, parts-of-speech tagging, document summarization, and even 100% accurate chatbot responses.

Not to mention that in doing so, BSD Neural Networks are the essential building blocks of reasoning and AGI. In fact, future articles will disclose the secrets to reliable AI reasoning (with BSD Neural Networks as the guaranteers of the reliability).

Sentence Splitting as Fact Extraction

BSD would be a remarkable discovery even if it only resulted in 100% accurate sentence splitting. After all, data scientists have been pursuing accurate sentence splitting for over 55 years. The quest for accurate Sentence Splitting has gone from rule-based approaches (1960s-1970s), to statistical approaches (1980s-1990s), to machine learning (2000s-2010s), to deep learning and neural networks (2010s-present).

Sentence splitting has long been recognized as a key component of fact extraction. Consider the following two real-world sentences:

  • The last 4 kilometres (2.5 mi) of the remaining original _Reichsautobahn_, a section of A 11 northeast of Berlin near Gartz built in 1936 — the westernmost remainder of the never-finished Berlinka — was scheduled for replacement around 2015.\[_needs update_\] Roadway condition is described as “deplorable”; the 25 metres (82 ft)\-long concrete slabs, too long for proper expansion, are cracking under the weight of the traffic as well as the weather.

Too many people use run on sentences—including Wikipedia authors and news’ journalists (and including this present author as well :-)). Meanwhile, LLMs struggle with complex sentences—both when pre-training parametric knowledge, and when such content is sent as input in RAG-based implementations.

Now consider the above sentence after being split by a fine-tuned BSD Neural Network using the same step-by-step method disclosed below:

  • The last 4 kilometres (2.5 mi) of the remaining original _Reichsautobahn_ was scheduled for replacement around 2015.
  • \[_needs update_\].
  • The last 4 kilometres of the remaining original _Reichsautobahn_ is a section of A 11.
  • The section of A 11 is northeast of Berlin.
  • The section of A 11 is near Gartz.
  • The section of A 11 was built in 1936.
  • The section of A 11 is the westernmost remainder of the never-finished Berlinka.
  • Roadway condition is described as “deplorable”.
  • The 25 metres (82 ft)-long concrete slabs are too long for proper expansion.
  • The 25 metres (82 ft)-long concrete slabs are cracking under the weight of the traffic.
  • The 25 metres (82 ft)-long concrete slabs are cracking under the weather.

Now it is trivial for LLMs to accurately answer questions.

AI Accuracy Plateau

This article teaches BSD using Sentence Splitting for a number of reasons. One of which is to demonstrate why NLP tasks have accuracy plateaus and how to permanently fix the plateau issue.

It’s no secret that AI has hit an accuracy plateau in terms of extractive question and answering. Thus, AI powerhouses have moved on to focus on images & video, mathematical & scientific reasoning, and code generation. Consider OpenAI’s very impressive strides in its newest image generator, and its current $3 billion pursuit of acquiring Windsurf.

However, little press is given to the fact that the major AI players have moved past trying to solve the extractive QA accuracy plateau. Nevertheless, TechCruch recently noted “OpenAI’s new reasoning AI models hallucinate more.” TechCrunch states:

According to OpenAI’s internal tests, o3 and o4-mini, which are so-called reasoning models, hallucinate more often than the company’s previous reasoning models — o1, o1-mini, and o3-mini — as well as OpenAI’s traditional, “non-reasoning” models, such as GPT-4o.

Perhaps more concerning, the ChatGPT maker doesn’t really know why it’s happening.

In its technical report for o3 and o4-mini, OpenAI writes that “more research is needed” to understand why hallucinations are getting worse as it scales up reasoning models. O3 and o4-mini perform better in some areas, including tasks related to coding and math. But because they “make more claims overall,” they’re often led to make “more accurate claims as well as more inaccurate/hallucinated claims,” per the report.

For now, OpenAI is currently not focusing on eliminating extractive QA hallucinations.

BSD — Key to Breaking through the AI Accuracy Plateau

SOTA methods for extracting facts from complex sentences hit an 80% accuracy ceiling — a seemingly insurmountable plateau. As the SOTA researchers stated:

The results are shown in Table 2. In all tables, best results are shown in bold. In both cases, scores increased across metrics as training size increased, although by smaller increments from the 3K mark upwards. Best results are split between the two largest training datasets on both BiSECT and DeSSE, indicative of a potential ceiling in terms of improvements from training data augmentation. It might also be the case that the observed plateauing resulted from a lack of variety in the added training data, but verifying this hypothesis was beyond the scope of this work.

The researchers discovered that accuracy did not improve as more training data was added. They reached a plateau.

Importantly, the researchers hypothesized that the plateau could be solved by adding variety to the dataset. This hypothesis is not only wrong, it is actually backwardsAs you will see below, the problem is due to variety. Thus, variety isn’t the solution, it’s the problem.

Fortunately, this bold statement can be empirically demonstrated—showing that the industry-wide assumption on neural network training is literally backwards.

5-Entry BSD Dataset Outperforms 1-Million-Entry SOTA Datasets

To empirically demonstrate that BSD is the missing key, a 5-entry BSD dataset was tested. It not only produced a result unheard of in the field—but the 5-entry dataset literally produced zero errors.

As data scientists often say: “The proof is in the pudding.” I couldn’t agree more. Now, here’s the pudding.

The SOTA datasets used to train neural networks for Sentence Splitting and Rephrasing are:

  • DeSSE
  • BiSECT
  • WikiSplit
  • WebSplit.

BiSect has 928,440 entries. WebSplit has 1,331,515 entries.

Despite having datasets with one million training examples, SOTA Sentence Splitting hit an accuracy ceiling of approximately 80%. Meanwhile, a 5-entry BSD set achieved 100% accuracy on a much more stringent test than was used to assess the SOTA method.

You read that correctly. A 5-entry BSD dataset outperforms a 1-million-entry dataset. Moreover, the BSD dataset demonstrated 100% accuracy as well. BSD truly is the revolution that the AI industry has been searching for.

Method: A 5-Sample BSD Dataset was used in few shot prompting to split sentences from 500 BBC news articles.

Result: The BSD method split the sentences with 100% accuracy.

Meanwhile, the SOTA method first fine tunes an LLM on up to 1 million examples, and then prompts the LLM to split sentences — resulting in an 18.4% error rate on narrative text (the same type of text used in BBC news articles).

The 5-sample few shot prompt demonstrates the remarkable breakthrough of BSD. Naturally, fine tuning models on a larger BSD dataset is recommended to ensure 100% accuracy on an ongoing basis. But the fact that a 5-sample few shot prompt results in 100% accuracy on a stringent test shows BSD to be the correct way to produce perfectly reliable results.

Current Method of Training Language Models is Literally Backwards

5 BSD entries significantly outperformed neural networks trained on over one million other types of entries. So what is the “secret”? The secret is that variety is the source of hallucinations—not the answer to them.

Each BSD dataset entry is structured in a very specific manner that communicates to the neural network precisely what it needs to learn to do. This has been the missing key to 100% accurate NLP neural networks. At least in terms of using supervised training to teach neural networks to perform natural language processing (NLP) tasks.

Supervised neural network training consists of providing the model multiple input => output pairs. The dataset tells the model what the output should be for any given input. The goal is for the model to discover the patterns that exist across the dataset; so that it can then transform inputs that it has never seen before into the desired outputs.

However, the industry has been training language-based neural networks using stochastic, non-deterministic methods. For example, the WebSplit dataset provides many grammatically correct outputs for each input. Consider the following sentence:

  • Input: “Auburn is part of Lee County in Alabama which is situated within the state of Alabama in the United States where one of the ethnic groups in the United States are the African Americans.”

The WebSplit dataset contains 64 alternative splits for this sentence alone. In other words, there are 64 entries in the dataset where the input is this same sentence. However, each of the 64 outputs provides one grammatically correct alternative for splitting that sentence. Hence, for this one sentence, there are 64 input => output pairs, where each output gives an alternative correct split.

In other words, there are 64 entries in the dataset where the above sentence is the input. Each of the 64 entries gives one acceptable output. For example, three of the output examples include:

  • Output: “Auburn is part of Lee County in Alabama . Lee County is situated within the state of Alabama . Alabama is in the United States . One of the ethnic groups in the United States are the African Americans .”
  • Output: “Auburn is part of Lee County , Alabama in the United States . African Americans are an ethnic group within the United States .”
  • Output: “Auburn , Alabama is part of Lee County . Lee County is in the state of Alabama . Alabama is in the United States . African Americans are an ethnic group within the United States .”

And so on. That’s 64 entries. Each entry has the same input. Each entry has a different output.

Notice that this is the opposite of determinism. Determinism, by definition, means that any given input will be transformed into only one correct output. Thus, WebSplit is a stochastic, non-deterministic dataset.

On one hand, the industry may seem to be pursuing the correct path. After all, there are many grammatically correct ways to split a larger sentence. Therefore, it can even seem incorrect for a neural network’s training to assign a penalty cost to a grammatically correct split during training.

Yet, as will be made clear shortly, BSD intentionally causes the neural network training to assign a penalty cost to grammatically correct sentence splits. In fact, counterintuitively, BSD often requires the model’s loss function to assign a cost to the vast majority of grammatically correct splits.

BSD requires that there is only one unique output for each unique input. Assuming there are only 64 ways to split the above sentence, this means that 63 out of 64 splits will be deemed an error during training, even though they are grammatically correct. In terms of this sentence, that means 98% of the grammatically correct splits are counted as being errors.

If there are more than 64 grammatically correct splits, then more than 98% of the grammatically correct hits will be considered to be an error when training a neural network using BSD.

Thus, BSD Neural Network training is the opposite of the way that language models have been trained. The following section unveils BSD’s revolutionary training method.

BSD Neural Network’s Seven Criteria (Steps)

BSD NLP stands for Bounded-Scope Deterministic NLP. The NLP part of the name signifies that the input text must contain at least one human-language sentence. The BSD part is built on two aspects: bounded in scope, and deterministic. Bounded scope refers to the number of required transformations being small enough to be learned (e.g., small enough to achieve a zero cost value from the loss function during training). As for the determinism aspect of BSD, there are seven criteria:

  • 1) There is only one unique output per unique input.
  • 2) The unique output must be deterministically derived from the input text.
  • 3) The selection of transformations that produce the output must be
    deterministically derived from the input.
  • 4) The selected transformations must be uniformly applied to all outputs.
  • 5) Where the resulting output has multiple values, such that the order of the values can be changed without information loss, the order of the values must be sorted in a deterministic manner. Preferably, first positional occurrence sorting is used.
  • 6) Where the deterministic selection of transformations can be null, there must be at least one input => output pair in which the inputs and corresponding outputs are identical in every respect. The inclusion of additional such pairs will reduce both the size of the neural network required and reduce the training time and cost.
  • 7) Where selection counter examples exist, they must be provided in the input, and the corresponding outputs must be identical to the input.

Contrasting SOTA Sentence Splitting to BSD Neural Network Training

Training neural networks on WebSplit does not involve any of the above steps. Training neural networks on the rest of the SOTA datasets does not involve implementing criteria 2–6. Yet, as is explained below, steps 2, 3, and 4 are core criteria; and steps 5, 6, and 7 are conditional core criteria. Hence, SOTA training lacks all of the core criteria (at least in terms of SOTA sentence splitting).

The following explains how to train a neural network to accurately split larger sentences into smaller ones.

Consider a simple transformation (Transformation X): Remove the word ‘and’; if the next word is a noun, then add the same punctuation used at the end and capitalize the next word; if the next word is a verb, add the same punctuation used at the end, add the noun subject of the prior statement, capitalize the added noun subject.

On the surface, splitting a sentence on the word ‘and’ appears trivial. However, even Transformation X is insufficient to qualify as being deterministic. What if the noun subject is a nested noun phrase? What gets added to the beginning of the new split: the entire nested noun phrase, the complex noun phrase, the noun phrase, or the root noun phrase? Each implementation must make a deterministic choice, and apply that choice consistently.

An ideal implementation would use the entire noun phrase (including nesting) to ensure the preservation of meaning. Consider the following sentence: “The old man and woman sat on the bench.” Is the woman old too? The sentence can be read in two ways. Preserving the entire noun phrase (e.g. “old man and woman”) in splits ensures preservation of the original language intent—even when the intent is ambiguous.

Most importantly, this deterministic criterion means that there is only one correct choice for what gets added to the beginning of the new split. One correct choice, and only one. Everything else is an error when computing the loss function — regardless of whether it is grammatically correct or not. Adding this step to Transformation X results in Deterministic Transformation X.

Even though Deterministic Transformation X is only a very simple example of criteria 2 and 3, notice already that none of the SOTA training methods do either of these. In other words, even before introducing additional transformations, BSD NLP is already different from SOTA sentence splitting.

Consider step #2: deterministically derive the output from the input. WikiSplit annotators had a free hand in choosing where to split. They also freely added words of their own choosing. Thus, step #2 was not performed in the creation of the WikiSplit dataset. The other training datasets also gave the annotators a free hand on where to split, and the annotators also added words of their own choosing. Thus, none of them implemented step #2.

This is literally the opposite of Deterministic Transformation X. Notice how Deterministic Transformation X dictates the precise words that must be added (e.g., the entire noun phrase length of the subject noun phrase (including nesting)). That is the mirror opposite of allowing annotators to choose. In BSD NLP, the D means there are no choices during training. If the deterministic transformation has two or more viable alternatives, then it is not a deterministic transformation in the first place.

Consider step #3: deterministically choose the selected transformation based on the input. Once again, the creation of the SOTA datasets did not include this step. WikiSplit and BiSect always split the input into two sentences. This means that the annotator subjectively chooses whether to split a particular sentence on “and,” or “but,” or “wherein,” etc. There is no deterministic selection of transformation based on the input.

However, Deterministic Transformation X always results in one split for each ‘and’ that serves as a coordinating conjunction. If there is one such ‘and,’ then there is one split. If there are two such ‘ands,’ then there are two splits. And so forth.

The mere fact that WikiSplit and BiSect force the input into two splits further demonstrates that step #3 was not used (in addition to not using step #2). Likewise, the annotators of DeSSE were instructed to pick one to four splits of their own choosing from a list of recommended splits. Hence, DeSSE also did not implement step 2 nor step 3.

Just as step #2 is the mirror opposite of SOTA training, so too is step #3 another step that is mirror opposite of SOTA training.

Now consider step #4: The selected transformations must be uniformly applied to all outputs. As stated above, in regards to Deterministic Transformation X, the transformation must be applied every time the word ‘and’ serves as a coordinating conjunction. Also as stated above, none of the SOTA training sets uniformly applied even one transformation across the entire training set, thereby not implementing this step as well.

Backwards Premise of SOTA NLP Training

SOTA NLP training is based on the premise that neural networks learn intelligence, with the idea being that if the neural network is given a variety of correct ways to split a sentence, then it can learn to choose the best way for any given new sentence.

BSD NLP is based on the exact opposite premise, which is why the steps are literally the mirror opposite of SOTA training methods. BSD NLP is based on the premise that every choice introduced in the outputs adds a degree of error — not a degree of intelligence. The fundamental training premises could not be more different. Thus, it deserves to be repeated:

Every choice introduced in the outputs adds a degree of error—not a degree of intelligence. (per internal testing at Acurai)

If you take away nothing else from this article, you will be well served in confirming the above truth for yourself. After all, this the missing key to training neural networks to achieve 100% accuracy on natural language tasks.

The Need for Matching Input => Output Pairs

Now consider step #6: Where the deterministic selection of transformations can be null, there must be input => output pairs in which the inputs and corresponding outputs are identical in every respect.

Not all sentences need to be split. For example, where splitting is solely based on Deterministic Transformation X, then sentences that do not have the word ‘and’ should not be split. Therefore, the training data needs to contain examples of when not to split. That is the meaning of step #6 as it relates to sentence splitting.

Yet, notice that none of the SOTA training sets contain even one instance where the input remains the same. Unlike SOTA, BSD NLP says that neural networks do not learn intelligence, but rather they learn to perform the path of least resistance instead. Thus, the neural network needs to be told when to do nothing so that doing nothing is included in its learned path of least resistance.

Notice that Deterministic Transformation X makes an evaluation on the word ‘and.’ It evaluates whether the word is serving as a coordinating conjunction.

Consider the following sentence: “Tom and Mary walked into the house and sat down.” Only the second ‘and’ serves as a coordinating conjunction. The first ‘and’ does not.

Step #7 means that there should be counter example inputs for every evaluation made by the deterministic selectors.

In terms of transformation X, this simply means there needs to be inputs that include the word ‘and’ where ‘and’ is not being used as coordinating conjunction; and therefore, there is no split. Hence, the output equals the input.

Again, since all the datasets solely contain splits, they also do not implement step #7 either.

In short, there are two types of non-splits (i.e. two types of output = input): inputs where no transformation is even selected, and inputs where the selected transformation declines to perform the transformation due to one or more deterministic evaluations. The criteria in steps #6 and #7 define the types of inputs to include to produce a corresponding output that signifies that a transformation did not take place. Hence, an alternative output to accomplish the same thing can be to return a predefined value (such as “[BLANK]”) as the target output, as this accomplishes the criteria of signifying when a transformation did not take place.

Once the steps are understood, they can easily be applied to training a neural network on virtually any NLP task, including sentence splitting. And because the training is based on the inverse of SOTA methods, it produces profoundly different results. In fact, where all the steps are followed in producing the input / output pairs, the resulting BSD NLP Network can achieve 100% accuracy — a significant leap in performance over prior methods.

Target BSD Output

An ideal BSD NLP implementation will employ all seven criteria/steps. However, steps 2–4 are core BSD NLP criteria. Steps 5–7 are conditional core BSD NLP criteria (i.e., they are core components in NLP tasks that meet the stated condition of the criteria). Consider an training task in which a transformation selection can be null. For such a task, step #6 is a core component because of this condition.

An ideal implementation will include all core criteria, and it will include all conditional core components that match the conditions of the NLP task being trained. Such an implementation produces Perfect BSD Target Outputs from the corresponding training inputs.

While the combination of core criteria ensures 100% accuracy, some NLP tasks may only require implementing some of the core criteria to significantly improve accuracy — even to the point of 100% accuracy. Moreover, BSD criteria are so transformative that even applying them to part of a dataset can significantly improve performance.

BSD Target Output refers to implementing at least one core criteria for transforming inputs containing human-language sentences into deterministically transformed NLP output. Where all core criteria are applied, as well as all conditional core criteria that are applicable to the conditions of the implementation, the NLP deterministic transformation of such sentence-containing training input is called Perfect BSD Target Output.

BSD NLP First Positional Occurence Sorting

None of the sentence splitting datasets implement step #5 because it does not apply to splitting a complex sentence into multiple sentences. The task itself results in ordered output — in order to preserve the meaning of pronouns.

However, some NLP tasks can result in the output containing multiple values whose values can be presented in at least one different order while preserving all information. Such NLP tasks meet the condition of step #5, and therefore, ideal implementations would include step #5 to ensure 100% accuracy.

Moreover, ideal implementations will use first positional occurrence sorting. This simply means sorting the order of the values based on the order in which they first appear in the input.

For complex NLP tasks based on multiple steps, a separate first positional occurrence sorting can be applied at each step. This is explained immediately below.

Consider the task of extracting facts about people in a text. Here, the task may involve two levels (i.e., two steps): identify all people, and identify all facts in the input about each person.

When there are multiple levels of an NLP task, ideal BSD implementations use first positional occurrence sorting for each level. Consider a series of self-contained statements. Some statements are about Alice, and others are about Bob. Alice is mentioned first. However, some of the statements about Alice occur after Bob is mentioned.

One deterministic method is to use a one-pass first positional occurrence sorting across the dataset. Thus, the Alice and Bob extractions will occur left to right in a single pass. Thus, some of the Alice statements will indeed be included in the target output after some Bob extracted statements.

However, a multi-level first positional occurrence would allow the target output to be deterministically organized as: {name}:\nFact_1\nFact_2\n… In other words, the facts about each person are grouped together immediately after the person’s name.

Since this is a two-level task, a two-pass first positional occurrence sorting can be used. The sort order of the names is determined by the first pass. The order of the extracted facts is determined by the second pass. In this way, all of the statements regarding Alice and Bob are grouped together under their respective names while still preserving the requirement of deterministic first positional occurrence sorting.

As long as each name is selected in the order in which they appear in the text; and as long as the facts regarding each name are listed in the order they appear in the text; and as long as the extraction of the facts is done in a deterministic manner (e.g., preserving the facts verbatim), the BSD neural network can now extract grouped facts about people with 100% accuracy.

BSD Neural Network Training revolutionizes the use of neural networks for NLP and the NLP subfield of AI. It consistently results in 100% accuracy, even on complex language tasks.

At first blush, the preference of first positional occurrence sorting may seem insignificant. However, modern language models are built on token-based transformers. These transformers do not have any inherent awareness of the individual characters in the words they are processing. Hence, using alphabetical sorting would require increasing the size of the model many magnitudes (if such can even overcome the limitation). However, token-based transformers inherently possess positional awareness. By basing the sorting on position, the sorting is based on the inherent capabilities of the architecture, thereby allowing smaller models to achieve 100% accuracy.

Example Implementation of a BSD Neural Network

Note: The language in this section is lifted from a patent application. Hence, it has a formal legal tone. However, it’s included here because it can be very helpful for those unfamiliar with supervised neural network training.

BSD Target Output refers to a target output that is deterministically derived from a training input in accordance with the above criteria.

Figure 1 and Figure 2 illustrate an example embodiment of a BSD Neural Network. Figure 1 depicts example hardware.

Figure 1 shows a BSD neural network 100 (e.g., an NLP server) that includes a volatile storage 101 and a non-volatile storage 102 communicatively connected to a processor 103. The processor 103 is communicatively connected to a network controller 104 that communicatively connects the BSD neural network 100 to an external network 105.

Figure 2 depicts an example process flow for training a neural network.

The Training Inputs 200 contain at least one human language component. Training inputs are converted into numerical sequences (usually by tokenization) such as converting text to numerical tiktokens (as OpenAI does for its GPT models). Another popular method is to use SentencePiece to convert text into numerical sequences (as the Llama family of LLMs does). Any method for converting text into numerical sequences falls within the spirit and scope of this step. The numerical sequences are the actual input into the electronic Neural Network 202. Example neural networks include RNN, CNN, and transformer-based (such as GPT). Any supervised neural network can be used, provided that it supports training on text inputs and outputs. The training method depicted in Figure 2 can be applied to both seq2seq and autoregressive models. Those ordinarily skilled in the art know how to set up the supervised training of seq2seq, autoregressive, and other supervised neural networks. They also know how to choose the model architecture for the given NLP task at hand.

In seq2seq, each input 200 would be sent to the Neural Network. In autoregressive training, a sliding window would likely be used where each numerical token from the target output 205 is appended token-by-token to the input 200 to form another input; whereas the next token in the target output is the desired result in the given iteration. Those ordinarily skilled in the art know how to implement both seq2seq and autoregressive networks without further explanation.

For each iteration (i.e., epoch), the Loss Function 204 computes the difference between the output 203 of the Neural Network 202 and the corresponding BSD Target Output 205. It is this step where a Loss Function 204 uses BSD Target Outputs to compute the “loss” (or “cost”). It is this step where over 98% of grammatically correct sentence splits can be assigned a penalty cost during BSD NLP training on sentence splitting.

Embodiments can use Cross-Entropy Loss (Log Loss), KL Divergence, Reinforcement Learning, Contrastive Loss or any other loss methods. Any loss method that computes cost relative to the output of the Neural Network and at least one BSD Target Output is a novel innovation, and therefore, falls within the spirit and scope of this disclosure (where the BSD Target Output is a bounded-scope, deterministic transformation of the correlating Training Input).

Herein, for simplicity, Loss Function shall refer to loss functions known in the art, as well other measurements such as those used in reinforcement learning. While loss functions would typically be used for computing token-by-token differences in NLP neural networks (such as Large Language Models), Reward Signals could be used on a whole sequence basis and are therefore simply referred to as Loss Function herein. Thus, the term Loss Function is not meant to limit the seq2seq or token-by-token loss calculations chosen for any given embodiment. The limitation is that at least one BSD Target Output be used when computing such. This is the step that can transform the current art from 80% accuracy to literally 100% accuracy. This step can be applied to virtually any Low-Level NLP Neural Network to profoundly increase accuracy. Where a zero loss is eventually reached, the accuracy can literally be 100%.

If the loss during the iteration is less than or equal to the chosen threshold 206 then the training is done 207. The current state of the trained parameters allows for the Neural Network to accomplish its task with optimal accuracy. The state of the trained parameters can be stored in RAM, on disk, in the cloud, or via any other method (thereby allowing the model and its optimal parameters to be replicated on various devices). Moreover, the model with the optimized parameters can be saved as a whole to permanent storage.

Once the threshold has been reached, any input can now be sent to the Neural Network, and the output will be accurate (up to 100% accurate where a zero loss has been reached).

If the threshold has not been reached 206, then the trainable parameters are adjusted relative to the loss 201. Methods for adjusting the parameters (such as weights and biases) are well-known in the art (such as using back propagation and gradient descent with optimizers such as Adam and RMSProp). As previously stated, the innovative step of determining loss based on outputs that are bounded-scope, deterministic transformations of the input can profoundly improve the accuracy of a multitude of NLP Neural Networks. Alternatively, where the scope cannot be bounded, determining loss based on deterministic transformation of the input will profoundly improve accuracy (where deterministic transformation meets the novel criteria disclosed herein). Hence, such would still fall within the spirit and scope of this disclosure.

BSD for 100% Accurate Sentence Splitting

BSD revolutionizes the technological field of Natural Language Processing (NLP) by yielding 100% accuracy for low-level NLP tasks. Herein, BSD shall be used as shorthand for BSD NLP.

It bears noting that BSD training data can alternatively be used in few shot prompting in addition to or in lieu of being used for fine tuning. In fact, a 5-shot prompt using the following training data resulted in 0 hallucinations when simplifying 2,500 sentences from BBC articles.

A simple sentence splitting implementation could include splitting complex sentences based on coordinating clauses that start with the word “and” (or another coordinating conjunction such as “but,” “or,” “for,” “nor,” “yet,” or “so”). The transformation must also dictate under what deterministic conditions will words be added, and there must be a deterministic method for knowing precisely what words will be added (e.g., the entire subject noun phrase including nesting). In this situation, there is one objective transformation for converting each input into the target output, thereby satisfying the “determinism” aspect of BSD.

In regards to 100% accurate sentence splitting, consider the following input/output pairs:

  • Training Input: The cat sat on the chair and it was purring.
    Target Output: The cat sat on the chair. It was purring.
  • Training Input: Tom drove home.
    Target Output: Tom drove home.

The above is based on a single objective transformation of training input to target output. The sentences are split on the word ‘and’ where the word is being used as a coordinating clause, and where the word that follows the word ‘and’ is a noun phrase. Since sentence two does not have the word ‘and,’ no transformation is selected resulting in the target output being equal to the training input.

Now, consider another simple BSD implementation with multiple objective transformations. As a reminder, where multiple objective transformations exist, the selection of such transformation(s) must be deterministically derived from the input itself.

With this in mind, another implementation could include splitting complex sentences using two objective transformations. The first objective transformation (OT) could be to split on coordinating clauses that begin with the word ‘and’ whenever the following word is not a verb (Deterministic Transformation Y). The second OT could be to split on coordinating clauses that begin with the word ‘but’ whenever the following word is not a verb (Deterministic Transformation Z). The multiple OTs would result in deterministically producing the following input/output training pairs:

  • Training Input 1: The cat was sitting on the chair and it was purring.
    Target Output 1: The cat was sitting on the chair. It was purring.
  • Training Input 2: The dog wanted the bone but it was out of reach.
    Target Output 2: The dog wanted the bone. It was out of reach.
  • Training Input 3: The dog was sitting on the chair and it wanted the bone but it was out of reach.
    Target Output 3: The dog was sitting on the chair. It wanted the bone. It was out of reach.
  • Training Input 4: Harry met Sally.
    Target Output 4: Harry met Sally.
  • Training Input 5: Tom and Mary drove home.
    Target Output 5: Tom and Mary drove home.
  • Training Input 6: But, he chose to come over.
    Target Output 6: But, he chose to come over.

While such an implementation would require a larger neural network than the prior example, the number of learnable parameters would still be quite small compared to some of the most popular models in the art.

Notice also that the correct splitting may be one sentence (no splitting), two sentences, or even three sentences. Where objective transformations are applied, the number of output sentences can vary. In fact, splitting complex sentences can result in anywhere from one to a dozen (or even more) simpler sentences in certain implementations.

Notice how the entries conform to the criteria:

  • Pair 1: Selecting and Implementing Deterministic Transformation Y
  • Pair 2: Selecting and Implementing Deterministic Transformation Z
  • Pair 3: Selecting and Implementing Deterministic Transformation Y & Selecting and Implementing Deterministic Transformation Z
  • Pair 4: Null Selection of Transformations (i.e., no transformations selected)
  • Pair 5: Selecting and Declining Deterministic Transformation Y
  • Pair 6: Selecting and Declining Deterministic Transformation Z

Hence, Pair 5 is an example of step #6. Pairs 5 and 6 are examples of step #7.

Deterministic Transformation Y makes a deterministic evaluation based on the word ‘and.’ The determination is whether to implement the transformation or decline to do so. Therefore, the neural network needs a training entry for each of these scenarios (e.g., Pair 1 and Pair 5).

Likewise, Deterministic Transformation Z makes a similar deterministic evaluation on the word ‘but.’ Hence, the neural network needs an example of both scenarios (e.g., Pair 2 and Pair 6).

Thus, the seven steps/criteria guide the creation of entries for various deterministic decisions (e.g., Select and Implement Y, Select and Decline Y, Select and Implement Z, Select and Decline Z, null Selection (i.e., no Selection)). It is in this way that the path of least resistance equals performing the desired task with 100% accuracy.

Neural Networks Learn Path of Least Resistance—Not Intelligence

Neural networks take the path of least resistance during the training process. For example, a neural network trained to detect pneumonia in chest X-rays learned to focus on metadata or markers in the images rather than the actual lung features. This occurred because certain hospitals included different markers or annotations in their X-rays, and the model learned to correlate those with the presence of pneumonia.

As another example, a study showed that image classification models like convolutional neural networks (CNNs) trained on the ImageNet dataset tend to rely on texture rather than shape for classification. For example, a neural network might classify a picture of a cat-like object covered in “elephant skin texture” as an elephant. This preference for textures is easier to exploit than learning the shapes and semantics of objects.

Given the importance of this phenomena, consider a final example from dermatology image classification. Models trained to detect skin cancer have relied on artifacts such as rulers or measurement tools often included in malignant samples. A model learned to associate the presence of a ruler with malignancy, a clear shortcut that bypassed the need for true diagnostic reasoning.

I appear to be the first to have realized this same form of self-organization found in image-based CNNs also occurs in transformer-based language models. Most importantly, I realized that this phenomenon can be transformed from being a problem into being the key to producing smaller models that are profoundly more accurate than larger models 10–100 times their size (even more accurate than models 1,000 times their size).

The key is to make 100% accuracy the path of least resistance. Appying the above BSD steps accomplishes this.

Sophisticated Sentence Splitting

A more sophisticated sentence splitting machine can include a set of objective transformations based on both clauses and prepositions. It can even include rewriting words, provided that the rewriting is deterministic.

For example, when choosing to write noun phrases during sentence splitting, an objective transformation must choose whether to consistently use a noun phrase, a complete compound noun phrase, a complete nested noun phrase, etc. The same objective transformation is applied consistently throughout the training set.

Likewise, consistency may be applied in regards to person named entities. For example, the chosen objective transformation may use the full name, or the last name, or an abbreviation, etc., provided that such is applied consistently throughout the training set.

Consider the following complex sentence: “Tom Smith of Dallas and husband of Mary loves to barbecue and he enjoys drinking beer.”

If the objective transformation is based on noun phrase, there is only one correct split (and therefore, the correct split is objectively deterministic):

  • Tom Smith of Dallas and husband of Mary loves to barbecue. Tom Smith enjoys drinking beer.

Any other split would be incorrect.

If the objective transformation is based on complex noun phrases, there is only one correct split:

  • Tom Smith of Dallas and husband of Mary loves to barbecue. Tom Smith of Dallas enjoys drinking beer.

Any other split, including the prior example, would be incorrect.

If the objective transformation is based on nested noun phrases, there is only one correct split:

  • Tom Smith of Dallas and husband of Mary loves to barbecue. Tom Smith of Dallas and husband of Mary enjoys drinking beer.

Any other split would be incorrect, including the prior two examples.

The bolded, italic terms illustrate how the objective application of a deterministic transformation provides the consistency that the neural network needs in order to fully master the task.

While all three choices (and others) are linguistically correct, 100% accuracy comes from teaching the neural network one consistent objective. The current SOTA wrongly believes that neural networks will try to figure out the best alternative. BSD NLP is based on the correct understanding that neural networks do the opposite — they consistently look for the path of least resistance instead. Thus, BSD provides the path of least resistance to ensure the task is truly mastered.

This is the missing key over SOTA training.

  • There are no 64 correct alternatives for a given input as is the case for neural networks trained on WebSplit.
  • There are no variations of purportedly correct outputs caused by various annotators choosing different ways to split the sentences (e.g., one annotator uses noun phrases, another uses complex noun phrases, another sometimes uses nested noun phrases and other times leaves the pronoun alone, etc.).
  • There is no starting with subjective human summaries (as in the case of DeSSE).
  • There is no starting with non-deterministic sentence graphs.

BSD NLP is the literal opposite of SOTA NLP models that are based on the faulty premise that neural networks can learn to choose the best alternatives. For 100% accuracy, neural networks need to be trained on only one definitive, deterministic transformation for each potential input type. The rest of neural network training can proceed as usual.

BSD — Literally The Only Way to Achieve 100% Accuracy

BSD is literally the only way to train neural networks to achieve 100% accuracy on language tasks. How can I make such a bold statement? The model’s hallucination rate is proportional to the degree that the neural networks and other models deviate from BSD. The inverse is that the closer neural networks and models are to BSD, the greater their accuracy.

Consider LIMO (Less Is More For Reasoning) as a perfect case in point. While the researchers did not apply a deterministic transformation, they did apply a more-normalized transformation—thereby inadvertently moving the training closer to a BSD model. Because it is not deterministic, they did not achieve 100%. But the mere fact of normalization profoundly improved accuracy.

For example, the prior SOTA on AIME reasoning benchmark reasoning was 6.5% (using 100,000 training samples). Meanwhile, LIMO achieved a 57.1% (using only 817 samples). In other words, LIMO achieved a 778% gain while using 1% of the training data size.

The industry is beginning to move in the direction of BSD. And it will continue to do so, because the closer it gets the better the results.

In short, BSD is the only way to achieve 100% accuracy because any deviation from it introduces errors (i.e. hallucinations).

Dawn of 100% Accurate AI

Acurai has already confirmed that 100% accuracy of BSD three times over.

The results of the BSD Sentence Splitting test have already been discussed above.

Acurai also tested BSD Summarization. More specifically, we wanted to compare the results of BSD to the challenges that Apple was facing in regards to creating headline summaries of BBC News Articles and such.

For those unfamiliar, BBC News filed a formal complaint with Apple regarding hallucinations in its automated summary headlines.

For example, one headline read: “Brazilian tennis player, Rafael Nadal, comes out as gay.” This short headline includes three hallucinations:

  • The story was about Joao Lucas Reis da Silva (not Rafael Nadal).
  • Rafael is not Brazilian.
  • Rafael has not come out as gay.

As another example.

No, Luigi Mangione did not shoot himself.

As another example.

No, Netanyahu was not arrested.

The summarization issue plagued other Apple services as well, including messaging summarization.

For example, Andrew Schmidt’s mother texted: “That hike almost killed me!” However, the summary that Schmidt first saw was a notice that his mom attempted suicide.

Summarization Method #1

The first method we tried was as follows:

  • Use BSD Sentence Simplification => BSD Coreference Resolution to create Formatted Facts (FFs) from the article.
  • Ask the LLM to choose the Formatted Fact that most represented the overall article.

While this approach led to 100% hallucination free summarization, it suffered from suboptimal relevance. The LLM was not capable of choose the most optimal FF.

Summarization Method #2

The second method we tried was as follows:

  • Standardize the article using a Spelling/Grammar Correction model.
  • Then use BSD Sentence Simplification => BSD Coreference Resolution to create Formatted Facts (FFs) from the article.
  • Then ask the LLM to create its own one-sentence summary.
  • Then use vector and index-based searching to locate the FF that was most similar to the sentence produced by the LLM.

This achieved 100% hallucination-free summarization that was also very relevant.

100% Hallucination Elimination on RAGTruth for GPT-4 and GTP-3.5 Turbo

I have previously written about Acurai’s 100% hallucination elimination on RAGTruth for GPT-4 and GPT-3.5 Turbo.

This was accomplished used Formatted Facts. This present article discloses how to perform BSD Sentence Simplification. The next article is going to teach how to perform BSD Coreference Resolution. Thus, you will know how Acurai produces Formatted Facts—step by step—with full transparency.

Acurai’s Methods Fully Revealed

I currently serve as the Chief Technology Officer at Acurai Inc. Acurai is shorthand for Accurate AI. Our mission is to deliver 100% accurate AI across various NLP tasks and knowledge domains.

I have received permission to share Acurai’s proprietary methods. Perhaps these methods can inspire you to develop new ones. However, if you want to use Acurai’s methods (or a derivation of them), it’s important to contact Acurai for permission to do so.

Also, if you want to learn Acurai’s proprietary methods prior to the publication of future articles, I encourage you to go straight to Acurai’s patent application.