Member-only story
How RAG Chatbots Work

Retrieval augmented generation is the process of supplementing a user’s input to a large language model with additional information that wehave retrieved from somewhere else. The LLM can then use that information to augment the response that it generates.
It starts with a user’s question.
The first thing that happens is the retrieval step. In it, we take the user’s question and search for the most relevant content from a knowledge base that might answer it.
It is by far the most important, and most complex part of the RAG chain. Essentially it’s about pulling out the best chunks of relevant information related to the user’s query.
We cannot send the knowledge base to the LLM.
- models have built-in limits on how much text they can consume at a time (though these are quickly increasing).
- cost — sending huge amounts of text gets quite expensive.
- there is evidence suggesting that sending small amounts of relevant information results in better answers.
Once we’ve gotten the relevant information out of our knowledge base, we send it, along with the user’s question, to the LLM, which then “reads” the provided information and answers the question. This is the augmented generation step.
Let’s start from the prompt level, because as I see it, we are augmenting the prompt with the retrieval step.
So, let’s assume we already have the relevant information pulled from our knowledge base that we think answers the question. How do we use that to generate an answer?
We give the LLM custom instructions with the system prompt.
The first component is the system prompt. The system prompt gives the language model its guidance. For ChatGPT, the system prompt is something like “You are a helpful assistant.”
We use system prompt because we want the LLM to do something more specific. And, since it’s a language model, we can just tell it what we want it to do. Here’s an example short system prompt that gives the LLM more detailed instructions:
You are a Knowledge…