How to Improve LLMs with RAG?

Learn how to enhance the capabilities of large language models with Retrieval Augmented Generation (RAG) systems. Explore the limitations of traditional fine-tuning methods, the role of knowledge bases, and the implementation of RAG for more nuanced responses. Dive into a practical example of improving YouTube comment responses and discover the potential of RAG for leveraging specialized knowledge.

Introduction to Retrieval Augmented Generation (RAG)
Understanding the Limitations of Large Language Models
The Role of Knowledge Bases in RAG Systems
Leveraging Text Embeddings for Semantic Search
Building a Knowledge Base: From Documents to Vector Databases
Implementing the Retriever: Retrieving Relevant Context for RAG
Constructing Prompts with Context: Enhancing Model Responses
Fine-tuning vs. RAG: Comparing Approaches for Specialized Knowledge
Case Study: Enhancing YouTube Comment Responses with RAG
Looking Ahead: Exploring Advanced Applications of RAG Systems

Introduction to Retrieval Augmented Generation (RAG)

I'll discuss how we can improve LLM-based systems using retrieval augmented generation or rag for short I'll start with a high-level overview of rag before diving into a concrete example with code and if you're new here welcome. I make articles about data science and Entrepreneurship and if you enjoy this article please consider subscribing that's a great no-cost way you can support me in all the content that I make a fundamental feature of large language models is the ability to compress World Knowledge.

The way this works is you take a huge slice of the world's knowledge through more books and documents than anyone could ever read in their lifetime and you use it to train a large language model what happens in this training process is that all the knowledge and Concepts and theories and events that have happened in the world that are represented in the text of the training data they get represented and stored in the model's weights so essentially what has happened.

Understanding the Limitations of Large Language Models

We've compressed all that information into a single language model well which has led to some of the biggest AI Innovationsthe world has ever seen there are two key limitations for compressing knowledge in this way the first limitation is that the knowledge that is compressed in a large language model is static which means that it doesn't get updated as new events happen and new information becomes available and for anyone that's used chat GPT and tried to ask it a question about current events probably have seen a message like this as my last update in January 2022 I don't have access to real-time information.

So I can't provide specific events from February 2024 the second limitation is that these large language models are trained on a massive Corpus of text the result of that is that they're really good at general knowledge but they tend to fall short when it comes to more Niche and specialized information mainly because the specialized information wasn't very prominent in their training data and so when I asked about how old I was it said that there was no widely available information about shahim tab's age he might be a private individual or not widely known in public domains one way we can mitigate both of these limitations is using retrieval augmented generation or rag for short starting with the basic question.

What is rag this is where we augment an existing large language model using a specialized and mutable knowledge base so basically we can have a knowledge base that contains domain-specific information that is updatable where we can add and remove information as needed the typical way we'll use a large language model are we'll pass it a prompt and it will spit out a response this basic usage will rely on the internal knowledge of the model in generating the response based on the prompt if we want to add rag into the mix it would look something like this so instead of starting with a prompt we'll start with say a user query which gets passed into a rag module and what the rag module does is that it connects to a specialized knowledge base and it will grab pieces of information.

The Role of Knowledge Bases in RAG Systems

Which are relevant to the user's query and create a prompt that we can pass into the large language model so notice that we're not fundamentally changing how using the large language model it's still prompt in and response out the only thing we're doing is augmenting this whole workflow using this rag module which instead of passing in a user query or prompt directly to the model we just have this pre-processing step to ensure that the proper context and information is included in the prompt one question.

You might wonder why we have to build out this rag module can't we just fine-tune the large language model using specialized knowledge so that we can just use it in the standard way and the answer to that question is yes so you can definitely fine-tune a large language model with specialized knowledge to teach it that information so to speak however empirically fine-tuning a model seems to be a less effective way of giving it specialized knowledge and if you want to read more about that you can check out Source number one Linked In the description below with this basic understanding of what rag is.

Let's take a deeper look into this rag module to see how it actually works the rag module consists of two key elements first is the Retriever and second is the knowledge base so the way these two things work together is that a user query will come in it'll get pass to the retriever which takes the query and searches the knowledge base for Relevant pieces of information it then extracts that relevant information and uses it to Output a prompt the way this retrieval step typically works is using so-called text embeddings before we talk about how we can use text embeddings to do search let's talk about what they are exactly put simply text embeddings are numbers.

Leveraging Text Embeddings for Semantic Search

That represents the meaning of some given text so let's say we have a collection of words like tree lotus flower daisy sun Saturn Jupiter basketball baseball satellite spaceship text embeddings are a set of numbers associated with each word and concept that we see here but the're not just any set of numbers they actually capture the meaning of the underlying text such that if we are to plot them on an XY AIS similar concepts are going to be close together while Concepts that are very different from each other are going to be spaced far away.

Here we see plants tend to be located close together celestial bodies tend to be close together Sports balls tend to be close together and things that you typically see in space tend to be close together notice that the balls are closer to celestial bodies than they are to say plants because perhaps balls look more like celestial bodies than they do plants and trees so the way we can use this for search is say each of these items is a piece of information in our knowledge base you know we have some description of this tree a description of this lotus flower the description of Jupiter and so on and so forth.

What we can do is represent each item in our knowledge base as a point in this embedding space and then we can represent the user's query as another point in this embedding space then to do a search we simply just look at the items in the knowledge base that are closest to the query and return them as search results that's all I'll say about text embeddings and text embedding base search, for now, this is actually a a pretty rich and deep topic which I don't want to get into in this article but I'll talk about in the next article of this series next let's talk about the knowledge base say you have a stack of documents.

That you want to provide to the large language model so you can do some kind of question answering or search over those documents the process of taking those raw files and turning them into a knowledge base can be broken down into four steps the first step is we'll load the documents what this consists of is getting together the collection of documents you want to include in the knowledge base and getting them into a ready to parse format the key thing here is that you want to ensure the critical information in your documents is in a text format because at the end of the day large language models only understand text so any information.

Building a Knowledge Base: From Documents to Vector Databases

You want to pass to it needs to be in that format the next thing you want to do is chunk the documents the reason that this is an important step is that large language model have a fixed context window which means you can't just dump all your documents into the prompt and pass it to the large language model it needs to be split into smaller pieces and even if you have a model with a gigantic context window chunking the documents also leads to better system performance because it often doesn't need the whole document it might just need one or two sentences out of that document so by chunking it you can and ensure that only relevant information is getting passed to The Prompt the third step is to take each of these chunks and translate them into the text embeddings we saw on the previous slide.

So what this does is it'll take a chunk of text and translate it into a vector or a set of numbers that represents the meaning of that text finally we'll take all of these numbers all these vectors and load them into a vector database over which we can do the text embedding based search we saw on the previous slide so now that we have a basic understanding of rag and some key Concepts surrounding it let's see what this looks like in code here we're going to improve the YouTube comment responder from the previous article with rag we're going to provide the fine-tuned model from the previous article articles from my medium blog so that it can better respond to technical data science questions and so.

This example is available on Google Colab as well as in the GitHub repository the articles that we use for the rag system are also available on GitHub and the fine-tuned model is available on the hugging face Hub so we start by importing all the proper libraries this is code imported from the Google collab so there are a few libraries that are not standard including llama index the PFT Library which is the parameter efficient fine-tuning library from hugging face there's Auto gptq Q which we need to import the fine tune model as well as Optimum and bits and bytes and if you're not running on collab also make sure that you install the Transformers library from hugging face with all the libraries installed.

We can just import a bunch of things from the llama index next we're going to set up the knowledge base there are a few settings we need to configure in order to do this first of which is the embedding model the default embedding model on the llama index is actually open AI but for this example, I wanted to keep everything within the hugging face ecosystem so I used this hugging face embedding object which allows us to use any embedding model available on the hugging face Hub.

Implementing the Retriever: Retrieving Relevant Context for RAG

I went with this one from ba AI it's called BGE small version 1.5 but there are hundreds if not thousands of embedding models available on the hugging face Hub the next thing I do is set this llm setting to none and the reason I do this is that it gives me a bit more flexibility in configuring The Prompt that I pass into the fine-tuned model and then two things I set here are the chunk size which I go with 256 characters and the chunk overlap this wasn't something I talked about but we can also have some overlap in between the chunks and this just helps avoid abruptly chopping a chunk in the middle of a key idea or piece of information that you want to pass into the model with all these settings configured.

We can create this list of documents using this simple directory reader object and the load data set method here I have a folder called articles which contains three articles in a PDF format from my medium blog and what happens is this line of code will just automatically go through read the PDFs chunk it and store them in this list called documents so there's actually a lot of magic happening under the hood here the next thing.

I do is just a little bit of ad hoc pre-processing of the text there are chunks that don't include any relevant information to the meat of the article itself and the reason is these PDFs were printed directly from the medium website so there's a lot of text that is before and after the article itself that's not really relevant to the use case here so here are just three ad hoc rules I created for filtering chunks the first thing.

I remove any chunk that includes the text member only Story the reason is this will typically be the text before the article and it'll look something like this it'll say member only story then it'll have the title of the article and then it'll have my name the author's name and then it'll say like where it was published it was 11-minute read when it was published and it'll have the image caption and some just irrelevant text to the article itself another rule I use here is that I remove any chunk that includes the data entrepreneurs.

Constructing Prompts with Context: Enhancing Model Responses

This is the text I include in the footer of each of my articles which links the reader to the data entrepreneurs community so you can see what that might look like is this is the last sentence of the article although each approach has its limitations they provide practitioners with quantitative ways of comparing the fat-tailed Inness empirical data probably not helpful to any questions that you're going to ask about this article and most of it is just text from the footer of the article and then finally I remove any chunk that has Min read which typically comes up in the recommendations after the article so we can kind of see that in this chunk of text here and of course this isn't a super robust way of filtering the chunks but a lot of times.

Your pre-processing doesn't have to be perfect it just really has to be good enough for the particular use case and then finally we can store the remaining chunks these remaining 61 chunks in a vector data store using this line of code so now we have our knowledge base set up so the index is our Vector database that we're going to be using for retrieval with a knowledge base set up the next thing we're going to set up is the retriever first we're going to define the number of docs to retrieve from the knowledge base and then we're going to pass that into this Vector index retriever object the two things.

We need to pass here the index or the vector database and the number of chunks to return from the search next we assemble the query engine so the query engine brings everything together it takes in the user query and spits out the relevant context and so we can use the query engine in the following way so let's say the query is what is fat tailedness which is the same technical question we passed to the fine-tuned model in the previous article of the series and the query engine spits out this response object.

Which includes the top three most relevant chunks but it also includes a lot of other information such as the file name that the chunk was retrieved from the page number the date accessed and some other metadata in order to take this response and turn it into something we can actually pass to a large language model we'll need to do a little reformatting which I do in this chunk of code here and then we can print what that looks like and so this text is probably small on your screen but you can see that there are three chunks of text and this is ready to go.

Fine-tuning vs. RAG: Comparing Approaches for Specialized Knowledge

Be passed into a prompt pred and subsequently fed into a large language model so at this point we now have all the ingredients of our rag module we have our knowledge base which was created using three PDFs and then we set up the retrieval step which takes in a user query and Returns the relevant context from the knowledge base so the next thing we need to do is to import the fine-tuned model so that we can generate a response to the user query.

Here we're importing a few things from the PFT and transforming forest Library this is the base model that we fine-tuned in the previous article and then here we transform the base model into the fine-tuned model based on the config file available on the hugging face Hub and then we load the tokenizer this is all stuff I reviewed in the previous article of the series so I would check that out if you're curious to learn more so now that we have the fine-tuned model imported let's use it to respond to a technical question without using the rag system.

So the way that'll look is we'll create a prompt this is the same prompt from the previous article and I'm creating a prompt template using a Lambda function which will dynamically take the instruction string here and the user comment which I Define here to create the prompt so when we print it looks something like this where we have the instruction start and end special tokens here we have the instruction string here and this goes on for a long time and then here we have the comment so with that prompt defined we can pass the prompt into our tokenizer to translate it from words into tokens.

We can then pass those tokens into our model to generate a response and then we can print the output this is what the model says without any context from the rag system great question-tailedness is a statistical property of distribution.

Case Study: Enhancing YouTube Comment Responses with RAG

The concept in risk management and financial modeling I hope this helps let me know if you have any questions GPT overall the tone in the format is really good here it explains things in simple terms however the description here doesn't really match the description in my article on fat tailedness or the article on fat tailedness so let's see what happens when we inject the proper context into the prompt using the rag System created earlier we'll again create a Lambda function so here we're not just passing in the comments but we're also going to pass in the context of the comment and so this is something.

we'll get from our rag system but given that we have both we'll have the same instruction string as before now we're going to inject the context into the prompt and then we'll put the comment in the same place as it was before so with this new prompt template we can pass in the context and comment to get a new prompt we'll pass in the prompt to the token tokenizer in exactly the same way as before to generate tokens.

We'll pass those tokens into the model to generate an output and then we'll print the results here's the model's response when we include context using rag great question tailedness is a property of a distribution that describes how much the presence of extreme values rare events affects the overall statistics of the distribution in the context of the article the author discusses.

Looking Ahead: Exploring Advanced Applications of RAG Systems

How mediocre and extremist distributions differ in terms of fat-tailedness Mediocris distributions have a relatively small impact from extreme values while extremist distributions are heavily influenced by extreme values the author also mentions that fat-tailedness is a more general concept than specific distributions like Paro or power laws instead it lives on a spectrum ranging from thin-tailed gaussian to very fat-tailed Paro 8020 I hope that clarifies things a bit let me know if you have any questions shot GPT so this does a much better job at capturing the way I describe fat tailedness in my article it defines fat tailedness as how much rare events.

Drive the overall statistics of distribution and it even talks about mediocre and extremist which is something I talked about in the article to frame the difference between thin-tailed and fat-tailed distributions I also like that it mentioned that fat-tailedness is not like a binary thing which is something I talked about in the article but rather it lives on a spectrum from not fat tail to very fat tail looking ahead to the next article of the series.

I'm going to dive more deeply into text embeddings which was an essential part of the rag system so I'll talk in Greater detail about text embeddings and discuss two major use cases namely semantic search and text classification if you enjoyed this article you want to learn more check out the blog in towards data science and even though this is a member-only story you can access it completely for free using the friend Link in the description below and as always thank you so much for your time and thanks for watching.

How to Improve LLMs with RAG? - icoversai

Table of Contents: