## TL;DR Retrieval Augmented Generation (RAG) is a way to make LLMs like GPT-4 more accurate and personalized to your specific data. - LLMs are powerful as hell, but they’re also __generic__: they’re trained on all data on the internet ever! - RAG helps you get more personalized responses tailored to your data by __embedding your data in your model prompts__ - RAG relies on the model’s __context window__, which is how much data in can take in a prompt - Today’s RAG pipelines are pretty complex and rely on __embedding models__ and __vector__ databases Alongside old school fine tuning, RAG is becoming the standard way to get better, more personalized results out of state of the art LLMs. ## Back to the future: training models The funny thing about RAG is that the basic concept has been around for as long as machine learning has. Long time readers will recall that back in the day, I studied Data Science in undergrad. “Old School” machine learning, before everyone was calling it AI, was __entirely predicated__ on training a new model for every problem. ### How old school ML worked: custom models Imagine you’re a Data Scientist tasked with understanding and predicting customer churn (your employer has a big churn problem). Your goal is to be able to predict a brand new customer’s chances of churning, so your marketing team can give them discounts and winback offers before they leave. Here’s what you might do: 1. __You gather a curated data set__ You spend time gathering all of the [historical data](/posts/how-do-product-analytics-work) your company has about churn: who churned, when, what characteristics they had when they did, and anything they did beforehand. Each customer gets a label: churned, or didn’t churn. 2. __You train a model on the data set__ Using either simple linear regression or something as complicated as deep learning with neural networks, your model goes through the data and tries to find patterns. It eventually learns (or tries to learn) which characteristics tend to lead to a customer churning, and which don’t. 3. __You test the model on new data__ To make sure the model isn’t just spitting your data back at you, you test it on new data and see how it performs. The model needs to __generalize__, meaning perform well on new data that doesn’t look _exactly_ like the data you trained it on. Models that are trained _too well_ on the training data and don’t generalize well are called “overfit.” ![machine learning nutshell](/images/posts/what-is-rag/machine-learning-nutshell.jpeg) The important theme here is that each model – whether you trained it from scratch, or took an existing model off the shelf – needed to be customized to _your_ data set. This was how everyone thought about machine learning when I was doing it professionally. Everyone has different problems, so everyone needs different models. ### Generative AI: not customized by default The Generative AI that we use today, like [ChatGPT or Claude](https://read.technically.dev/p/technically-dispatch-chatgpt-and), __isn’t like this at all__. Instead, it’s trained on one, colossally large data set – all of the internet – that isn’t curated and doesn’t belong to your business at all. You prompt the model to focus it on a specific problem, say, [generating an outreach email](/posts/comparing-available-llms), and it outputs something that represents the data it was trained on. The broad strokes, non-customized nature of these GenAI models is fine for some use cases. But for many, especially business use cases, you _need_ models to be aware of __your data__ and give you those kinds of tailored responses. You can’t build a customer support chatbot that doesn’t know anything about your product . So how _do_ you customize GenAI models to work with your data? ## The basic idea of RAG: data in context windows The most straightforward way to customize a model like GPT-4 would be to __retrain__ it on your unique data, updating the model itself along the way. You can do this – it’s called [fine tuning](https://openai.com/index/introducing-improvements-to-the-fine-tuning-api-and-expanding-our-custom-models-program/) (future post forthcoming) – but it’s expensive and requires a lot of infrastructure to do. What if there was a way to keep the models _themselves_ the same, but somehow get them to _output_ more customized responses aware of your data? In late 2020, researchers from [FAIR](https://ai.meta.com/) – Meta’s AI research organization – [penned a paper](https://arxiv.org/pdf/2005.11401) introducing the concept of Retrieval Augmented General, or RAG. The idea is that instead of retraining a model to better know your data, what if you just __included that data in the prompt?__ In a sense, it’s like including all the information the LLM would need to answer your question…in the question itself (🤯). To understand how this works, we need to talk about __context windows__. When you prompt [an LLM](/posts/deep-dive-how-do-large-language-models-work), it has a limited amount of data it can accept in that prompt, usually measured in “tokens.” You can think of a token as a single word in your prompt, although this is a bit of an oversimplification (every model tokenizes text differently). A context window is simply the maximum amount of tokens you can include in a prompt. Here are some examples of maximum context windows for popular LLMs at the time of writing: - __Llama3__: 8,000 - __Mistral Large 2__: 256,000 - ChatGPT: 32,000 - __Gemini__: 128,000 – 1M Early versions of LLMs had tiny context windows, so including data other than your pithy prompt would have been impossible. But today, you can throw a bunch of data in there and have plenty of room left. Context windows have been getting larger and larger with better research, and it’s likely they’ll eventually cease to be a constraint at all. So what does it look like to build RAG practically? ## A real world RAG workflow Though you could theoretically just paste some relevant data into a ChatGPT prompt, this isn’t how developers do RAG today. You’re a diligent Data Scientist, so you spent plenty of time gathering data from around the dusty corners of your company, from user data to customer support transcripts to documentation. You can’t just throw it all into every prompt. How do you prepare that data for the model prompt? Where do you store that data? And how do you decide which parts of that data to include for each different prompt? Enter the __vector database__, a topic that will get its own post one day. It’s a specialized [data store](/posts/the-beginners-guide-to-databases) for data that you intend to use for machine learning: you store your data in a simplified format called [an embedding](https://read.technically.dev/i/103876466/neural-networks-and-language-models), that ML models tend to like. These kinds of databases are optimized for efficiently storing data in these vector formats, and very quickly retrieving the data when a model needs it. [Pinecone](https://www.pinecone.io/), [Chroma](https://www.trychroma.com/), and [Weaviate](https://weaviate.io/) are some of the more popular vector DBs. Once you’ve got your data stored in a vector database, you can run a fast, efficient RAG pipeline: ![gmail filter steps](/images/posts/what-is-rag/gmail-filter-steps.jpeg) The steps go like this: 1. You prompt the model. 2. The pipeline takes your prompt, and uses it to search through your data to find relevant pieces of information to include. 3. Your prompt, plus the data, get sent to the model. 4. Profit?? The vector database plays two important roles here: not just as a place to store your data in the correct format, but also as a layer to __query__ that data and search it intelligently. Since vectors are blobs of text, searching isn’t that simple: there are special algorithms [baked into DBs like Pinecone](https://docs.pinecone.io/guides/data/query-data) that take a search and return their best guess at what data you might want. You, of course, don’t need to build this yourself from scratch. Frameworks like [LangChain](https://www.langchain.com/) and [LlamaIndex](https://www.llamaindex.ai/) give developers the building blocks to assemble these kinds of pipelines without having to write tons of code. ## Recap - Old school ML was all about __custom training__ on your unique data, but LLMs are generically trained on the whole of the internet - To get great responses from LLMs, you need to __personalize them__ to your specific data - __Retrieval Augmented Generation__ helps you do that by attaching your data into your model’s context window - A __context window__ is the amount of data that a model lets you include in a prompt - To build a RAG pipeline, you will want to store your data in a model friendly format in a __vector database__