Have you been struggling to implement enterprise-ready AI? There’s a secret you need to know in order to get there.
If you want to implement enterprise-ready AI, you must learn the secret behind why today’s models are both more powerful yet hallucinate much, much more.
As The New York Times reported last week, OpenAI and Google “don’t know why” their latest models perform remarkably better on math and science benchmarks while simultaneously hallucinate a lot more often.
This article explains exactly why this happening. Most importantly, it also explains exactly how to fix it.
Essential AI Vocabulary
ChatGPT captured the world’s attention in November 2022—two and a half years ago. It’s success introduced a new vocabulary: generative AI, hallucinations, Retrieval Augmented Generation (RAG), etc.
However, the technology has been growing so fast that we still need new words to conceptualize it. For example, from current vocabulary it’s hard to understand how “better” AI can have more “hallucinations”. But once we introduce some new vocabulary, the mystery of why “better” AI hallucinates more actually becomes self evident.
The tremendous issue that’s perplexing OpenAI, Google, and others becomes self-apparent the moment you understand three categories of generative AI:
Recombinant AI
Recall AI
Reasoning AI
The moment you understand the three types of generative AI, you will be ready to implement 100% accurate, enterprise-ready chatbots. I promise.
Executive Summary
In short, there are three types of generative AI. Each has its own purpose, and each needs to be trained differently.
Recombinant AI: Involves generatively mixing learnings into new patterns. This can range from ideation, essay writing, coding, and even recombinant summaries.
Recall AI: This involves providing accurate answers that are extracted from one or more learned sources (i.e. answering questions through recall).
Reasoning AI: This involves making accurate deductions.
ChatGPT launched as a recombinant AI service. It wasn’t intended to be used for factual recall. In fact, the way that it was trained literally makes 100% accurate recall impossible. (More on this below.)
Enterprises are looking for a different type of generative AI—Recall AI. However, OpenAI and others leap frogged over training Recall AI in order to pursue Reasoning AI instead.
Training Recall AI requires a precise set of criteria. The farther an LLM’s training diverges from this criteria, the more hallucinations the LLM will produce. And that’s precisely what has been happening. The newer models have a greater divergence from this training criteria; therefore, they hallucinate much more often.
This fact bears repeating:
Training Recall AI requires a precise set of criteria. The farther an LLM’s training diverges from this criteria, the more hallucinations the LLM will produce. And that’s precisely what has been happening. The newer models have a greater divergence from this training criteria; therefore, they hallucinate much more often.
Fortunately, once the criteria are adhered to, 100% accurate Recall AI results—providing enterprises the AI they have been looking for all along.
Sutskever thought people would find ChatGPT “boring” because it was designed to be recombinant (not useful for accurate recall). As Sutskever stated:
When you asked it a factual question, it gave you a wrong answer. I thought it was going to be so unimpressive that people would say, ‘Why are you doing this? This is so boring!’ — Ilya Sutskever regarding the launch of ChatGPT
If you have been struggling to implement enterprise AI, it’s important to know the following: ChatGPT was not originally intended to be used for providing factual answers to questions. That was not the original purpose of its release. It was intended to offer recombinant processing of source materials—not accurate recall of them.
Nevertheless, ChatGPT revolutionized AI. After all, there is tremendous value to Recombinant AI for ideation, coding, and more. Recombinant AI is also spectacular for image and video generation as well. All of these areas have been truly transformed by Recombinant AI.
However, Recombinant AI must be trained stochastically in order to fulfill its purpose. In other words, it must be trained to be probabilistic (the antithesis of deterministic).
Hallucinations Are Inherent In Recombinant AI
Recall AI must be deterministic. Every question must be answered directly from the provided sources without deviation.
The stochastic training used in recombinant AI produces degrees of deviation. Such deviation inherently results in hallucinations. In fact, you can think of hallucinations as deviation errors.
Square Peg Meet Round Hole
OpenAI and other LLM makers quickly realized the mass interest in using chatbots for Question and Answering (Q/A). Therefore, they sought to add this capability to their models. The problem is that these LLM makers tried to force stochasticrecombinant AIto perform the deterministicrecall task.
Where deviation errors do not occur, the results of the models are stunning. However, when deviation errors occur, the models’ results can be utterly ridiculous. That’s why ChatGPT can be both stunning and ridiculous at the same time.
At first, LLM makers kept trying to force the recombinant architectures to produce accurate recall. But they hit an inevitable ceiling as demonstrated by the disappointing release of ChatGPT 4.5.
In view of GPT-4.5’s dismal performance, OpenAI officially declared that GPT-4.5 is the last of its recombinant models. OpenAI is now exclusively pursuing reasoning models instead.
o1: Dawn of Reasoning LLMs
OpenAI has officially shifted to focusing on reasoning models. Reasoning models are designed for deduction. Training Reasoning AI involves techniques like:
Chain-of-thought (CoT) prompting or training.
Intermediate step supervision (e.g., supervising intermediate thoughts, not just final answers).
Private chain of thought (as in o3): the model reasons internally before generating an answer.
Enhanced tool use, planning modules, or scratchpads for intermediate computation.
While such techniques do indeed improve deduction, these techniques cause training to further diverge away from the criteria needed to train Recall AI. This increased divergence causes increased hallucinations. This is the answer to the tremendous issue that is currently perplexing OpenAI, Google, and other LLM makers.
This bears repeating:
Such techniques do indeed improve deduction. However, these techniques cause training to further diverge away from the criteria needed to train Recall AI. This increased divergence causes increased hallucinations.
Yes, recall hallucinations are indeed “worse than ever.” However, now that we know the cause, we also know the solution.
BSD: Dawn of Recall LLMs
I work at a company called Acurai Inc. Acurai has taken the road less travelled. At Acurai, we focus on the boring side of AI—100% accurate recall.
I have received permission from Acurai’s CEO (Adam Forbes) to publish every detail of our company’s proprietary Bounded-Scope Deterministic (BSD) Models—the first models in the category of Recall AI.
In short, BSD introduces deterministic training to natural language models, thereby producing 100% consistent results. Everything is disclosed in the series linked above.
Why RAG Fails
Retrieval Augmented Generation (RAG) is the most popular approach to addressing hallucinations. However, it routinely fails to eliminate hallucinations.
On the surface, the intuition for RAG seems sound: If I send the facts to the LLM then it cannot hallucinate when providing the answer.
So why does the LLM still hallucinate? The answer is that you are sending the facts to a recombinant processor that inherently deviates from the provided facts. Such deviation is inherent in the stochastic training.
This is why LLMs fail to produce accurate answers even when you provide the answers using RAG.
Build Enterprise-Ready AI… Today
With BSD, enterprise AI is finally available now. Recall AI is what enterprises have been looking for.
If you want to know every step in building 100% accurate Recall AI, I encourage you to read the entire series. Enterprise-ready AI is already here. You just have to know where to look. 🙂
In RAG, the goal is to locate the stored information that has the highest percentage of sameness to the provided query. Vector similarity search does not do this. That’s why RAG fails.
Wrong Tool for the Job
RAG fails in production because vector embeddings are the wrong choice for determining percentage of sameness. This is easily demonstrated. Consider the following three words:
King
Queen
Ruler
King and ruler can refer to the same person (and are thus considered synonyms). But king and queen are distinctly different people. From the perspective of percentage of sameness, king/ruler should have a high score and king/queen should be literally zero.
In other words, if the query is asking something about a “king” then chunks discussing a “queen” would be irrelevant; but chunks discussing a “ruler” might be relevant. Yet, vector embeddings consider “queen” to be more relevant to a search on “king” than “ruler.” Here are the vector similarity scores for queen and ruler when compared to king using OpenAI’s ADA-002 embeddings:
King
Queen: 92%
Ruler: 83%
When asking for information regarding a king, passages with information regarding a queen will take precedence over passages regarding a ruler; even though the queen passages cannot be relevant at all.
Vector Embeddings are Wrong for: Who, What, When, Where, and How Questions
The vector embedding problem does not only occur with words referring to people (such as king), it also occurs with words referring to things.
Consider a query asking about the traits of a cat. Passages discussing dogs should have a score of zero in regards to percentage of sameness; and passages dealing with felines should have an extremely high score. Yet, once again, vector embeddings get this wrong:
Cat
Dog: 86%
Feline: 85%
Even though the scores are 1% different, this still means that passages regarding dogs take precedence over passages regarding felines; even though the dog passages have zero relevance and the feline passages are extremely relevant.
The vector embedding issue isn’t even just confined to people and things, but also affects searches regarding time.
Consider a question regarding the 1900s. From a percentage of sameness standpoint, passages regarding the 1700s should be zero percent, and passages regarding the 20th century should literally be 100% (as ‘1900s’ and ‘20th century’ are interchangeably the same). Yet, once again, vector embeddings misrepresent degree of sameness:
1900s
1700s: 91%
20th century: 89%
Notice that the 1700s are considered strongly more similar (despite a literal 0% relevancy) compared to the 20th century (despite it being literally the exact same thing as 1900s).
Words that mean the exact same thing are called absolute synonyms or perfect synonyms. Yet, even in regards to absolute synonyms, vector embeddings give priority to things that are not even synonyms at all—as the following example further demonstrates.
“The Big Apple” is a direct reference to New York City. Now consider Susan, a New Jersey resident who wrote a slew of blog posts regarding the restaurants, museums, and other places she visits in her home state. However, one of Susan’s posts states that she got married in “The Big Apple.” A visitor to Susan’s website asks the chatbot: “Has Susan ever been to New York?”
Unfortunately, the numerous entries regarding New Jersey would take precedence over Susan’s marriage posting. Why? From a vector embedding perspective, “New Jersey” is more semantically similar to “New York” than “The Big Apple” is:
New York
New Jersey: 90%
The Big Apple: 89%
Depending on the number of postings involving “New Jersey,” the reference to “The Big Apple” might not be included even if the chatbot requests hundreds of potential candidates. Thus, vector embeddings can fail regarding locations (e.g. New York), just as it can for people (e.g. Kings), things (e.g. cat) and time (e.g. 1900s).
In fact, vector embeddings can fail for instructions as well.
bake a cake
bake a pie: 93%
make a chocolate cake: 92%
Consider a query asking how to “bake a cake.” Passages that discuss “bake a pie” (93% score) will take precedence over passages stating “make a chocolate cake” (92% score); even though the former is completely irrelevant and the latter is directly relevant.
The above examples show that vector similarity is not a reliable measurement of percentage of sameness. In fact, it is not reliable for people (king), things (cat), times (1900s), locations (New York), or even instructions (bake a cake). In other words, vector embeddings do not reliably measure percentage of sameness for questions regarding who (people), what (things), when (times), where (location), and “how to” (instructions). Said another way, vector embeddings are fundamentally flawed for virtually every type of question that a person can ask.
Query Context will not Save You
Critics of an earlier version of this article unanimously shout “context matters.” They argue that the similarity of individual words somehow doesn’t matter because the context of the query somehow resolves everything.
First, these critics completely ignored all the studies detailed below. The studies on OP-RAG, KG-RAG, RankRAG, LongRAG, etc. document that the query context does not magically resolve the math.
Second, these critics need to take the time to apply the same math above to multiple words. This is a study that I have personally conducted. If they did so, they would see that the math gets worse as more words are added, not better. Most especially if a keyword in the query is paired with the wrong semantically similar word.
As one example, ChatGPT-4 used to give the wrong mother of Afonso II. Instead, ChatGPT-4 gave the mother of Alfonso VII (an entirely different person). The reason is that both Afonso and Alfonso are semantically similar (even though they are 0% the same). More importantly, ChatGPT-4 gave the wrong answer because of query context. Consider the following query: “Who was the mother of AfonsoII, the third king of Portugal?”
In the training data, the word “mother” is found close to the word Alfonso.
There was no word “mother” close to the word Afonso in the training data.
Therefore, the context of “mother” caused ChatGPT to overlook the fact that Afonso II and Alfonso VII are entirely different people. The query context made the matter worse, not better. For a more detailed explanation of the Afonso Debacle, see the link to my tutorial at the end of this article.
OpenAI has since fine-tuned the Afonso answer, just as it does with other public hallucinations, only making ChatGPT even worse.
The same goes for vector embeddings by themselves. If the same training data was used to provide chunks for RAG, the RAG-based chatbot would give the same result. “mother” + “Alfonso” has greater vector similarity to the query than “Afonso” alone.
mother of Afonso
mother of Alfonso: 93%
Afonso: 90%
Thus, the query context only made things worse, not better.
What RAG Traditionalists are not Telling You
Perhaps you may wonder if the above examples are cherry picked. Or perhaps you may wonder if the percentage scores don’t actually matter. So let’s take a look at what RAG enthusiasts aren’t telling you by comparing the gaslighting presentation of RAG vs how RAG actually works.
Gaslighting Presentation of RAG: Store the vector embeddings of millions of chunks in a vector database. Get the vector embedding of the user’s query. Using cosine similarity, find the top three matching chunks and send them to the LLM with the query. This is a “fast, accurate, and scalable” solution (quote from a leading AI author whose company has taught over 400,000 people — see below).
How State-of-the-Art RAG Actually Works: Load vectors for thousands of documents into a vector database. Retrieve almost 50,000 characters of chunks to send to the LLM along with the query, resulting in an unreliable chatbot (e.g. an F1 score lower than 50).
Consider the release of OP-RAG on September 3, 2024.
OP-RAG is the work of three Nvidia researchers. Thus, this study comes from reputable researchers.
Also, the results in the above chart are regarding the EN.QA dataset. Here are the first two questions in that dataset:
when is the last episode of season 8 of the walking dead
in greek mythology who was the goddess of spring growth
Thus, the answers are short. They do not require lengthy exposition. Moreover, the dataset consists of only 3.8% of the larger Wikipedia corpus.
Yet, even with all the resources of Nvidia, a relatively modest dataset size, and relatively short answers, the researchers broke the prior state-of-the-art with a new RAG method that achieved a 47.25 F1 score by sending 48K of chunks along with the query (as sending less results in an even lower F1 score).
Did these Nvidia researchers fail to get the memo that they should’ve been able to store more than 25 times the amount of vectors, and consistently find the relevant answer in the top three matches? Of course they did not. That’s not how RAG works in the real world. Also see Nvidia’s LongRAG released on November 1, 2024 as another perfect case in point.
Advanced RAG Won’t Save You
I’m writing this article because of the many forum posts I see where data scientists and programmers believe they are doing something wrong. Usually, some well-intentioned person will throw out a myriad of things to try: reranking, query rewriting, BM25, Knowledge Graphs, etc. Throwing everything against the wall hoping that something sticks.
Reranking
Reranking is perhaps the most recommended Advanced RAG strategy. However, as the RankRAG study shows, even using a fine-tuned model for reranking only results in a 54.2 score on EN.QA. Using general reranking models had an even worse score.
GraphRAG and Knowledge Graphs
A recent study on KG-RAG (RAG enhanced with Knowledge Graphs) showed an F1 score of 25% and an accuracy of 32% for CWQ dataset. Interestingly, Knowledge Graph RAG had a lower accuracy than regular embedding RAG (which had a 46% accuracy).
As for Microsoft’s GraphRAG, Microsoft itself admits that it only achieves a level equal to naive RAG! As stated by Microsoft: “Results show that GraphRAG achieves a similar level of faithfulness to baseline RAG.” “As baseline RAG in this comparison we use LangChain’s Q&A” (aka naive RAG). See “GraphRAG: Unlocking LLM discovery on narrative private data”.
Keyword Hybrid Search
Even adding BM-25 keyword search and/or Hyde and/or summarization still results in an average score less than 0.50 across benchmarks.
The combination of various Advance RAG search methods resulted in a top average score of 0.446. However, even this level of “accuracy” is impractical in real-world chatbots. In the study, the mere combination of BM-25 + Hyde took 11.71 seconds per query.
Real-World vs Hype
There simply is no study showing that vector embeddings, combined with dozens of Advanced RAG techniques, results in a reliable chatbot in production environments containing numerous documents. Moreover, the added latency of many Advanced RAG techniques makes them impractical for real-world chatbots—irrespective of the accuracy issue.
In order to get over 80% correctness, RAG needed to send 64K characters of chunks to OpenAI’s o1. None of the other models reached 80%, including GPT-4o, GPT-4 Turbo, and Claude-3.5 Sonnet. Yet, there are numerous problems with the o1 results.
First, the hallucination rate is still too high.
Second, o1 is extremely slow even when processing short contexts. Processing 64K of context is unbearably slow.
Third, o1 is expensive to run.
To top it all off, word on the street is that the latest batch of upcoming models fail to deliver any significant improvement over already released models—with Anthropic even indefinitely delaying the release of any new model.
But even if larger models could overcome the problem, they would be slower and more expensive. In other words, they’d be too slow and too expensive for any practical purpose. Would companies pay more for a chatbot than for a person, when the chatbot would require up to a minute for each unreliable answer?
That’s the actual state of RAG. That’s the actual outcome of relying on vector embeddings.
It’s Not You. It’s Them.
The problem is that what is being taught to hundreds of thousands of people is patently untrue. The following is from a book updated in October 2024, written by cofounders of a company that has taught over 400,000 people:
RAG is best suited for scenarios where you need to process large datasets that cannot fit within a single LLM context window and when fast response times and low latency are necessary. …
Nowadays, a RAG system has a standard architecture already implemented in popular frameworks, so developers don’t have to reinvent the wheel. …
Once the data is converted into embeddings, vector databases can quickly find similar items because similar items are represented by vectors close to each other in the vector space, which we refer to as a vector store (storing vectors). Semantic search, which searches within vector stores, understands the meaning of a query by comparing its embedding with the embeddings of the stored data. This ensures that the search results are relevant and match the intended meaning, regardless of the specified words used in the query or the type of data being searched.
As the math shows, vector embeddings do not find items based on percentage of sameness. They do not understand the meaning of a query. They most certainly do not “ensure” that search results are “relevant” even with the simplest queries, let alone “regardless of the specified words used in the query or the type of data being searched.”
As the research paper on OP-RAG shows, even with 400 chunks retrieved via vector searching, the LLM can fail to find relevant information more than 50% of the time on the most simple of benchmarks. Nevertheless, data scientists are taught in textbooks: “In a real-world project, one might upload a whole website or course to Deep Lake [vector database] to search across thousands or millions of documents. … To generate a response, we retrieve the top-k (e.g. top-3) chunks most similar to the user’s question, format the prompt, and send it to the model at 0 temperature.”
Textbooks currently teach students that vector embeddings are so powerful that they can store “millions of documents” and then find the relevant answer to queries in the “top-3” chunks. Again, the math and cited research studies show this to be patently untrue.
The Road to 100% Accurate Responses
The answer to the problem is to stop relying on vector embeddings.
Does this mean that vector embeddings are useless? No! Not at all! They have a very important use in Natural Language Processing (NLP).
For example, vector embeddings are a powerful tool to use with words that have multiple meanings. Consider the word ‘glasses’ as an example. This word can refer to drinking glasses and eyewear glasses (among other things).
Now consider the following query: What type of glasses does Julia Roberts wear? Vector embeddings will help ensure that chunks regarding eyeglasses will be above chunks that refer to drinking glasses. That’s where their semantic power lies.
The launch of ChatGPT brought about a rather unfortunate shift in the data science community. Important NLP tools such as the use of synonyms, hyponyms, hypernyms, holonyms, and more were set aside in favor of Chatbot queries.
There is no doubt that LLMs obviated some parts of NLP. But we are currently in the stage where the data science community has thrown out the proverbial baby with the bathwater.
LLMs and vector embeddings are the missing piece of the NLP puzzle. They are not the entire picture in and of themselves.
For example, companies have long noticed that visitors leave their sites when chatbots don’t provide the product listings they are looking for. Therefore, companies tried replacing their keyword-based search with synonym-based search.
The synonym-based search did find products that the keyword-based search could not. But it came at a price. Words that have multiple meanings often caused irrelevant information to drown out that which the visitor wanted. For example, a visitor looking for drinking glasses might get a lot of listings presenting eyewear glasses instead.
Yet, rather than throw the whole thing out, this is where vector embeddings can come to the rescue. Rather than relying on vector embeddings, they can be used as a refinement instead. Rely on the synonym-based search, and then use vector embeddings to get the relevant listings to the top.
Once you have the relevant listings, methods such as those disclosed in the Acurai research paper can be used to produce 100% accurate, hallucination-free responses. These methods will soon be included in my series on Eliminating Hallucinations.
I’ll also be adding a section on RAG — including novel search methods that, when combined, rapidly pinpoint relevant sentences and sections from within millions of documents. The retrieved information can then be converted to Fully-Formatted Facts for 100% accurate, hallucination-free responses.
I’ll hopefully have time to add to the series over the holidays. However, I’ve been wanting to write this present article for a long time, due to the number of people who feel they are somehow failing to implement what they’ve been taught. For now my message is short: It’s not you. It’s them.