QLoRA—How to Fine-tune an LLM on a Single GPU

Discover how to set up, fine-tune, and optimize quantized language models using advanced techniques like QLoRA and Retrieval-Augmented Generation. Learn best practices for prompt engineering, tokenizer usage, and performance enhancement in NLP.

Introduction to Model Fine-Tuning Techniques
Overview of Fine-Tuning Techniques
Importance of Parameter Efficiency
Memory Efficiency in Large Language Models
Traditional Fine-Tuning and Memory Requirements
Reducing Memory Footprint with Low-Rank Adaptation (LoRA)
Advancements with Cur and Low-Rank Adaptation (LoRA)
Utilizing 4-bit Normal Floats with Cur
Combining Double Quantization with LoRA
Practical Example: Fine-Tuning Mistal 7B Instruct
Using Google Colab for Fine-Tuning
Loading Quantized Models from Hugging Face
Preparing the Model and Environment
Importing Required Libraries and Handling Compatibility Issues
Loading and Tokenizing Models and Datasets
Prompt Engineering for Improved Model Responses
Crafting Effective Prompts for YouTube Comments
Utilizing Feedback to Enhance Model Performance
Implementing Parameter Efficient Fine-Tuning (PFT)
Configuring LoRA for Efficient Training
Understanding Parameter Reduction and Memory Savings
Training the Model: Configuration and Execution
Setting Hyperparameters and Training Arguments
Managing Training Process and Monitoring Performance
Evaluating and Refining Model Outputs
Analyzing Model Responses and Identifying Issues
Techniques for Improving Response Quality
Exploring Retrieval-Augmented Generation (RAG)
Benefits of Incorporating Specialized Domain Knowledge
Future Directions for Enhancing Model Responses

Introduction to Model Fine-Tuning

Overview of Fine-Tuning Techniques

Fine-tuning is when we tweak an existing model for a particular use case. Although this is a simple idea applying it to large language models can be complicated. The key challenge is that large language models are very computationally expensive which means fine-tuning them in a standard way is something other than fine-tuning them in a standard way. You can do it on a typical computer or laptop. In this article, I'm going to talk about Cur which is a technique that makes fine-tuning large. L language models are much more accessible.

If you're new here welcome, I'm Icoversai. I make content about data science and Entrepreneurship and if you enjoy this article, please consider subscribing that's a great no-cost way you can support me in all the content that I make.

Since I talk in depth about fine-tuning in a previous article of this series. Here I'll just give a high-level recap of the basic idea. I said before fine-tuning is tweaking an existing model for a particular use case. So an analogy for this fine-tuning is like taking a raw diamond and refining it and Distilling it into something more practical and usable like a diamond. You might put on a diamond ring in this analogy. Fine-tuning is when we tweak an existing model for a particular use case. Although this is a simple idea applying it to large language models isn't always straightforward.

The key challenge is that large language models are very computationally expensive, so fine-tuning them in a standard way is not something you can do on a typical computer or laptop.

In this article, I'm going to talk about Cur which is a technique that makes fine-tuning large L language models much more accessible.

Importance of Parameter Efficiency

Since I talk in depth about fine-tuning in a previous article of this series. Here I'll just give a high-level recap of the basic idea. So like I said before tuning is tweaking an existing model for a particular use case. So an analogy for this fine-tuning is like taking a raw diamond refining it and Distilling it into something more practical and usable like a diamond. You might put on a diamond ring in this analogy. The raw diamond is your base model so this would be something like gpt3.

The final Diamond you come away with is your fine-tuned model which is something like chat GPT and so again. The core problem with fine-tuning large language models is that they are computationally expensive to get a sense of, let's say you have a pretty powerful laptop and it comes with a CPU and a GPU where the CPU has 16 GB of RAM and your GPU has 16 GB of RAM. Let's say we want to finetune a 10 billion parameter model each of these parameters corresponds to a number that we need to represent on our machine.

A standard way of doing this is using the fp16 number format which requires about two bytes of memory per parameter. So just doing some simple math here 10 billion parameters time. 2 bytes per parameter comes to 20 GB of memory just to store the model parameters. So one problem here is that this 20 GB model won't fit on the CPU or GPU but maybe we can get clever in how we distribute the memory. So the load of the model is split between the CPU and GPU and that allows us to do things like inference and make predictions with the model.

However, when we talk about fine-tuning we're talking about retraining the model par parameters which is going to require more than just storing the parameters of the model. Another thing we need is the gradients.

Memory Efficiency in Large Language Models

Traditional Fine-Tuning and Memory Requirements

These are numbers that we use to update the model parameters in the training process. We'll have a gradient which is just going to be a number for every parameter in the model. So this adds another 20 GB of memory. We've gone from 20 to 40 and now even if we get super clever with how we distribute it across our CPU and GPU.

GPU it's still not going to fit so we'd actually need to add another GPU to even make that work but of course, this isn't the whole story you also need room for the optimizer States. So if you're using an Optimizer like atom which is very widely used. This is going to take the bulk of the memory footprint for model training, where this is coming from an Optimizer an atom that is going to store a momentum value and variance value for each parameter in your model. So we'll have two numbers per parameter additional.

Reducing Memory Footprint with Low-Rank Adaptation (LoRA)

The values need to be encoded with higher Precision. So instead of the fp16 format, these are going to be encoded in the fp32 format and so when it's all said and done there's about a 12x multiplier for the memory footprint of these Optimizer states which means we're going to need a lot more GPUs to actually fine-tune this model.

These calculations are based on reference number two which is a paper about zero, which is a method for efficiently fine-tuning. These deep neural networks work, so we come to a grand total of 160 GB of memory required to train a 10 billion parameter model. Of course, these enormous memory requirements aren't going to fit on your laptop and it's going to require some heavy Hardware to run so 160 GB if you get an 80 GB GPU like a 100. You'll need two of those at least and those are about $20,000 a pop. So you're probably talking about $50,000 just for the hardware to fine-tune a 10 billion parameter model in the standard way. This is where Cur comes in so Cura is a technique that makes this whole fine-tuning process much more efficient so much. So that you can just run it on your laptop here without the need for all these extra GPUs before diving into Cur.

A key concept that we need to understand is quantization and even though quantization might sound like this scary and sophisticated word. It's actually a very simple idea. Whenever you hear quantization just think of splitting a range of numbers into buckets. So as an example let's consider any number between 0 and 100. Obviously, there are infinite numbers that can fit in this range. You know there's like 27, 55.3, 83.7, 823, and so on and so forth.

What quantization consists of is taking this infinite range of numbers and splitting it into discrete bins. One way of doing this is quantizing this infinite range using whole numbers so what that would look like for our three numbers. Here is that 27 would go into this 27th bucket 55.3 would go into this 55 bucket and then 83.78% would go to 20. 55 would go to 50 and 83 would go to 80. So that's the basic idea and the reason.

This is important is that quantization is required whenever you want to represent numbers in a computer and the reason is that if you want to encode a single number that lives in an infinite range of possibilities. This will require infinite bytes of memory. It just can't be done at some point when you're talking about a physically constrained system like a computer. You have to make some approximations and so if we go from this infinite range to this range quantized by whole numbers, this would require about 0.875 bytes per number, and then if we go one step further and just split it into these 10 different buckets it would require about half a byte per number.

Advancements with Cur and Low-Rank Adaptation (LoRA)

Utilizing 4-bit Normal Floats with Cur

One thing to point out here is that there's a natural tradeoff. You know we could have a lot of buckets which would give us a lot of precision, but it's going to increase the memory footprint of our model. However, you could have very few buckets for quantization which would minimize the memory footprint. But this would be a pretty crude approximation of the model. You're working with so much balancing. This tradeoff is a key contribution of Q LoRa.

There are actually four ingredients that come together to make up Q LoRa. The first is a 4-bit normal float. The second is double quantization. The third is paged optimizers and then finally is LoRa. I'm going to talk through each of these ingredients one by one starting with the ingredients. One 4-bit normal float. All this is a better way to bucket numbers. It's a better way to do quantization. So let's break it down when we say something is 4-bit, what we mean is.

We're using four binary digits to represent that piece of information and since each digit can be either zero or one this gives us 16 unique combinations. This means with a 4-bit representation, we have 16 buckets at our disposal for quantization compressing a range of numbers into just 16 buckets is great for memory saving. You know we only have four bits which translates to half a bite per parameter. So if we have 10 billion parameters that's going to translate to 5 GB of memory but of course this brings up the same problem I mentioned earlier. Which is we have this tradeoff it's like yeah we get huge memory savings but now we have a very crude approximation of the number we're trying to represent the way ingredient.

One of these 4-bit normal float works with buckets, the numbers in a particular and clever way. Suppose we have all the parameters in our model and we plot their distribution. When it comes to these deep neural networks it turns out that most of the parameter values are going to be around zero and very few values are going to be much smaller and much larger than zero. What that means is we have something that resembles a normal distribution when it comes to our model parameters.

If we follow a quantization strategy that I talked about a couple slides ago where we just split the numbers into these equally spaced buckets. We're going to get a pretty crude approximation of these model parameters because most of our numbers are just going to be sitting in these two buckets. Here with very few numbers sitting in these end buckets. Here is an alternative way. We can do quantization instead of using equally spaced buckets.

We can consider using equally sized buckets so instead of mapping each parameter into these eight buckets we map these parameter values into these eight buckets. So now you can see that we have a much more even distribution of model parameters across these buckets and this is exactly the idea that 4-bit normal float uses to balance that tradeoff between low memory and accurately representing model parameters.

Combining Double Quantization with LoRA

The next ingredient is double quantization which consists of quantizing, the quantization constants. I know the word quantize appears way more than anyone would ever like on this slide, but let's break it down step by step to see what this all means, so consider this simple quantization strategy. So let's say we have this array of numbers X that's represented using 32 bits and we want to translate it into an 8bit representation. On the left-hand side here and then we want this 8-bit representation to have values between minus 127 and 127.

Essentially what we're doing is quantizing by whole numbers forcing it to live in this range of 127 to 127. So that's what we're trying to do a simple way of doing that is we rescale all the values in this array by the absolute maximum value in the array and then we'll multiply it by the new maximum value which is 127 in our quantized range and then we'll round it. Just so that there are no decimal points. This is a very simple way we can quantize this arbitrary array encoded in 32bit into an 8bit integer representation and just to make this more simple.

We can translate this prefactor here into a constant encoded in 32bit so this simple quantization strategy isn't how we do it in practice. Again if we're doing the equally sized buckets it's not just going to be this linear transformation that we're seeing here but this does illustrate the point that anytime you do Quantization. There's going to be some memory overhead involved in that computation. So in other words these constants are going to take up precious memory in your system.

So as an initial strategy, you might think well if we have this input tensor or input array and we rescale all the parameters. We're only going to have one new constant. A 32-bit number for all the parameters in our model. What's the big deal about that, what's another number compared to 10 billion parameters. For example, while this does have trivial memory implications it may not be the best way to quantize our model parameters because this is going to be very sensitive to extreme values in our input tensor and the reason is if we're talking about these model parameters were most of them are close to zero. But then you have this one parameter that is way far off in the tails, that is your absolute Max. It's going to introduce a lot of bias in your quantization process.

Practical Example: Fine-Tuning Mistal 7B Instruct

Using Google Colab for Fine-Tuning

The standard quantization approach does minimize memory but it comes with maximum potential for bias. An alternative strategy could be as follows where we take the input tensor. We reshape it to look like this and then we split this tensor into buckets and then within each bucket. We do the rescaling process so this significantly reduces the odds of one extreme value skewing all the model parameters in the quantization process, this is called blockwise quantization.

Although it comes with a greater memory footprint. It has a lot less bias to mitigate the memory cost of this blockwise quantization approach. We can employ double quantization which will do this quantization process here but then we'll do the quantization process. Once again on all these constants that pop up from this blockwise approach. So if we just kind of repeat this very simple strategy here.

Now we have an array of constants we have multiple constants popping out they're encoded in 32-bit and then we can quantize them into a lower-bit format using this simple approach that's double quantization so we are indeed quantizing the quantization constants. While it might be an unfortunate name. It is a pretty straightforward process so ingredient three is a paged Optimizer. All we're doing here is looping in your CPU into the training process so let's say we have a small model like 51 which has 1.3 billion parameters based on those same calculations.

We saw earlier would require about 21 GB of memory to fully fine-tune the dilemma. Although we have enough memory across the GPU and CPU for all 21 GB needed to fully fine-tune 51. This isn't something that necessarily just works out of the box. These are independent modules on your machine and typically the training process will just be restricted to your GPU so this paged Optimizer what that means is instead of just restricting training to only fitting on your GPU. You can move memory as needed from the GPU to the CPU and then bring it back onto the GPU as needed what that might look like is you'll start model training.

You'll have one page of memory and a page of memory is like a fundamental unit or block of memory on the GPU or CPU. The pages will start accumulating during the training process until your memory gets full and then at which point, if you have this paged Optimizer approach you can start moving pages of memory over to the CPU to make room for new memory for training. If you need a page of memory that was moved to the CPU back onto the GPU. You can make room for it there and then you can just move it back over using this paged Optimizer. This is the basic idea.

Honestly, I don't know exactly how this all works. I'm not like a hardware guy. I don't know how computer architecture fully works but this is my high-level understanding of a data scientist. So if you want to learn more check out the cura paper where they talk a little bit more about it and provide some additional references.

The final ingredient of cura is lora which stands for low-rank adaptation. So I actually talked about Lura in depth in a previous article on fine-tuning. So here I'm just going to give a brief high-level description of how it works. If you want more details you can check out that previous article or check out the low R paper Linked In the description below. What Lora does is fine-tune a model by adding a small number of trainable parameters, so we can see how this works by contrasting it with the standard full fine-tuning approach.

Let's say this is our model here. This is our neural network and we have this input layer. We have some hidden layer and then we have the output layer. Here full fine-tuning consists of retraining every single parameter in this model. We're just considering one layer at a time. We'll have this weight Matrix corresponding to all these lines in this picture. Here we'll have this Matrix WKN consisting of all the parameters for that particular layer and all of these are trainable while that's probably not going to be a big deal about these six parameters in this shallow Network.

If you have a large language model. These matrices will get pretty big and you'll have a lot of them because you'll have a lot of layers. Lora on the other hand instead of fine-tuning every every single parameter in your model it'll actually freeze every parameter in the model and it works by adding a small set of trainable parameters which you'll then fine-tune. The way this works is you'll have your same hidden layer and then you'll add a small set of trainable parameters through this Delta W Matrix. So if you're looking at this, you might think well how does this help us because Delta W is going to be the same size as W KN.

So how is this adding a smaller set of trainable parameters and so the trick with Laura is that this Delta W will actually be the product of two small matrices b and a which have the appropriate Dimensions to make all the math work out. So visually what that looks like you have your W KN here but then you have BNA which has far fewer parameters than W KN but when you multiply it together. Their product it'll have the proper shape to make all the Matrix operations work here so you'll actually freeze W KN so you won't train these parameters and then these parameters housed in BNA will be trainable.

Ones the result of training the model, this way is that you can get 100 to even 1,000x savings and model parameters. So instead of having to train 10 billion parameters. You're only having to train like 100 million parameters or 50 million parameters. So let's bring these four ingredients together.

Loading Quantized Models from Hugging Face

Let's first look at the standard fine-tuning approach as a baseline. So here let's say we have our base model represented in fp16. So we'll have this memory footprint from the base model and then we'll have this larger memory footprint from the optimizer States and then we won't have any adapters because adapters only come in when doing Lora or another parameter-efficient fine-tuning method and so we'll do like the forward pass on the model.

It'll go to the optimizer and then the optimizer will do the backward pas and will update the model parameters. This is the same standard fine-tuning approach we talked about earlier. So a 10 billion parameter model will require about 160 GB of memory. Another thing we could do is use Lora so we can get that 100 to 1,000x Savings in the number of trainable parameters. We still have our model represented in 16bit. But now instead of fine-tuning every single parameter in the model we only have a small number of trainable parameters and then each of those parameters will have an Associated op Optimizer state which significantly reduces the memory footprint. So a 10 billion parameter model would only require about 40 GB of memory.

While this is a tremendous savings like a 4X Savings in memory 40 GB is still a lot to ask for from consumer Hardware. So let's see how Cur helps the situation even further the key thing here is that instead of using the 16bit representation. We can use ingredient one and encode the base model as a 4-bit normal float and then we'll have the same number of trainable parameters from Lura. So that'll be exactly the same and then we can use ingredient 3 with the paged optimizers to avoid any out-of-memory errors that might come up during the training process with that and including the double quantization.

Here we can use Cur to fine-tune a 10 billion parameter model with just about 12 gigabytes of memory which is something that can easily fit in consumer Hardware and can even run using the free resources available on a Google collab. So let's see a concrete example of that here we're going to do fine-tuning using mistol 7B.

Preparing the Model and Environment

Importing Required Libraries and Handling Compatibility Issues

They were instructed to respond to YouTube comments. This example is available on the Google collaboration associated with this article. The model and data set are freely available on Hugging Face, and additionally, there is a GitHub repo that has all the resources, put together, and the code to generate the training data set here.

The first thing we need to do is import some libraries everything here comes from Hugging Face their Transformers Library. Their PFT Library is parameter-efficient fine-tuning. This is what's going to allow us to do Q lora and then we're using hugging fa's data sets library because I uploaded the training data set onto hugging faes Hub and then finally, we just imported the Transformers Library. These are kind of like sub-dependencies to ensure some of these modules work. I think it's mainly this one that prepares me for kbit training. You don't need to import these but you need to make sure that they're installed in your environment and this was a pain because bits and bytes only work on Linux and Windows and Nvidia hardware and then gptq.

This format for encoding models doesn't run on Mac so as a Mac User this was kind of frustrating lots of trial and error to try to get it to work on my machine locally, but I wasn't able to get it to work so if anyone was able to get it to run on an M1 or an M2 or even M3. Send me your example code or send me any resources you found helpful. I would love to get a version working on my machine but since collab, they have a Linux environment using Nvidia Hardware. The code here works fine next we can load the quantized model.

Loading and Tokenizing Models and Datasets

Here we're going to grab a version of mistol 7B instruct from the bloke and so if you're not familiar with the bloke. He's actually quantized and shared thousands of these large language models completely for free on the hugging face Hub and then we can just import this model using this from the pre-trained method. So we just need to specify the model name on The Hub device map. Set to Auto just has the Transformers Library kind of figure out the optimal way to spread the load between the GPU and CPU to load in the model trust remote code.

Basically, it's not going to allow a custom model file to run on your local machine. So this is just a way to protect your machine when downloading code from The Hub and then revision main is just saying we want the main version of the model available at this repo here. Then again gptq which is the format used here does not run on Mac.

There are some other options with Mac but I wasn't able to get it working on my machine. Once we have the quantized model loaded we can load the tokenizer. So we can do this actually pretty easily using this from a pre-train method. So we just specify the model name and then specify this use fast argument as true with just those two simple blocks of code. We can use the base model. One thing we do here is we put the model into evaluation mode which apparently deactivates the Dropout modules.

Next, we can craft our prompt. So let's say we have a command from YouTube that says great content. Thank you and then we put it into the proper prompt format. So mistal 7B instruct is an instruction-tuned model. So it's actually Expecting The Prompt in a very particular format and namely, it's just expecting this instruction to start and instruction end special tokens in the prompt. So we set that up very easily what this is doing is it's just going to dynamically take this comment variable and stick it into this prompt here and then once we have that we can pass the prompt to the tokenizer so basically we're taking this prompt and we're translating it from a string into an array of numbers and then we can take that array of numbers and we can pass it into our model to generate more text.

Prompt Engineering for Improved Model Responses

Crafting Effective Prompts for YouTube Comments

Once we do that, we can get the outputs and then pass them back into our tokenizer and have the tokenizer decode. The vector back into English, the output of this great content, "thank you" the comment is.

You might have about the content I've already provided just let me know which article or blog post you're referring to and I'll do my best to provide you with accurate and up-to-date information thanks for reading. I look forward to helping you with any questions you may have so while this is a fine response there are a few issues with it one it's very long I would never respond to a YouTube comment like this.

Utilizing Feedback to Enhance Model Performance

Second it kind of just like repeats itself. It's like glad you found it helpful, feel free to ask, and then it says Happy to answer questions that you have happy to provide you with accurate updated information, and like look forward to helping you with questions. So saying the same thing in different words a few different times and then finally it says, thanks for reading and if this is for YouTube comments people aren't reading this stuff. They're reading articles. So one thing we can do to improve model performance is by doing so-called prompt engineering.

I have an in-depth guide on prompt engineering where I talk about seven tricks to kind of improve your prompts in a previous article on the series. So feel free to check that out if you're interested in the prompt that I ended up using. Here is something that I generated through trial and error and the way I did that is using a website called Together, which I can link in the description below essentially together. They have a chat interface kind of like chat GPT but for a lot of open-source models including Misty 7B instruct version 0.2.

I was able to test a lot of prompt ideas and get feedback and just kind of eyeball which gave the best performance and I ended up using that one. So I have this set of instructions here icoversai. GBT functioning as a virtual data science consultant on YouTube communicates in clear accessible language escalating to technical depth upon request,. It reacts to feedback applied and ends responses with its signature icoversai GPT. covers GBT will tailor the length of its responses to match the viewer's comments providing concise acknowledgements to brief expressions of gratitude or feedback.

Thus keeping the interaction natural and engaging then I have this instruction. Please respond to the following comment and then I have this Lambda function where given a comment. I'll piece together this instruction string and comment together within the instructions special tokens that the model is expecting. I can just pass the comment to the prompt template and generate a new prompt, what that looks like is. This so you see we have the instruction special tokens, you see it's well formatted. This is the instructions, please respond to the following comment and say great comment thank you.

Implementing Parameter Efficient Fine-Tuning (PFT)

Configuring LoRA for Efficient Training

Now using this new prompt instead of just passing the comment directly to the model. We have this set of instructions with the comment. This is the response: "Thank you for your kind words,. I'm glad you found the content helpful, Icoversaicoversai GPT". So this is really good this is actually already pretty close to how I typically respond to YouTube comments and a lot of them tend to be something like this and it appropriately signed off as Icoversai GPT. So people know that it came from an AI and not from me personally. Well maybe, we could just call it here. It's like okay. This is good enough let's just start using this as the comment responder.

let's see how we can use Q Lora to improve this model even further using fine-tuning. So the way to do that is we need to prepare the model for training. So we'll put it from eval mode into training mode. We're going to enable gradient checkpointing which isn't something I talked about and it's not necessarily part of the qora technique because it's actually pretty standard. It's just a memory-saving technique that clears specific activations and then recomputes them during the backward path of the model and then we need to enable quantized training.

The base model going to be in 4-bit and we're going to freeze them but we still want to do training in higher Precision with Lora. We need to make sure that we enable this quantized training option. Next, we want to set up Lora so we can use that Lora config file.

Understanding Parameter Reduction and Memory Savings

I talk more about low RA in the fine-tuning article. So just briefly going through this we're going to set the rank as 8 and set the alpha s32. We're going to Target the query modules in the model. We're going to set drop out to 0.5. We're not going to have any bias values and then we're going to set the task as causal language modeling with the config file.

We can pass the model and the config into this method to get a PFT model. So this will just create a lower trainable version of the model and then we can print the number of trainable parameters. So doing that we see that we actually have a significant saving, so less than 1% of the original number of trainable parameters just one point of confusion for me personally is it's showing that Mistol 7B instruction has 264 million parameters here based on the quick research.

It did seem like when you do quantization, there could be some terms that you can drop. But honestly, I don't fully understand why we went from 7 billion parameters to just 264 million parameters so if anyone knows that please drop it in the comments. I'm very curious but the main point here is that we're only using 0.8% of the original number of train parameters. So huge memory savings using Lora.

Training the Model: Configuration and Execution

Setting Hyperparameters for Fine-Tuning

Next, we're going to load the data set which is freely available on the hugging face Hub, it's called shot. GPT YouTube comments also that the code to generate this data set is available at the GitHub repo. If you're curious about how to do the formatting and stuff then here's an example from this data set. You'll see, that we have the special token the start, the string, and the end string. We have the start instruction and end instruction and then we have the same set of instructions as before and then we have the comment here which is a real comment from the YouTube channel.

Then after the instruction string, we have the actual response. I left this comment and then I just appended this Icoversai GPT sign-off so the model learns the appropriate format in style that it should respond to. We've got a data set of 59 of these examples. So not a huge data set at all

Then next we need to pre-process the text. So this is very similar to how I did it in the previous find-tuning article. Basically, we Define this tokenized function which if the example is too long. So if it's longer than 52 tokens it's going to truncate it. So it's not more than this max length and then we'll return it as numpy values and then we can apply this tokenized function to every single example in the data set using this method. Here the map method where we have our data set and then we just pass in the tokenized function and set batched equal to true, so it doesn't batches. I guess instead of doing it one by one.

Managing Training Process and Monitoring Performance

The other thing we need to do is the handles. If the examples are too long but when you're training the model each example in a batch. They actually need to be the same size so you can actually do matrix multiplication. So for that, we can create a data cator what that does is if you have multiple examples of different lengths. So let's say you have like four examples in a batch and they're all of different lengths. The data cator will dynamically pad each example.

They have the same length for that we need to define a pad token which I set as the end of the string token and then I create the data collator using this method here and then I think this is a masked language modeling set equal to false and that's because we're doing so-called causal language modeling not masked language modeling. Now we're ready to start setting up the training process.

Evaluating and Refining Model Outputs

Analyzing Model Responses and Identifying Issues

So here we're setting hyperparameters. We have the learning rate batch size and number of epochs. We're setting the output directory of the model. The learning rate, the batch size goes here, the number of epochs goes here, and weight decay—we set it as 0.01. For logging evaluation and save strategy we set it to every Epoch which means every Epoch will print the training loss. we'll evaluate at every Epoch we'll also print the validation data set loss and then save the strategy. So we'll save the model every Epoch.

In case something goes wrong we're going to load the best model at the end because maybe the best model was actually at the eighth Epoch and it got worse on the ninth Epoch or something like that. Gradient accumulation is equal to four warm-up steps equal to two. So I actually talk a lot about gradient accumulation and weight decay in the previous article. On training a large language model from scratch, so if you're curious about what's going on there you can check out those articles.

Next, we'll set fp16 equal to true. So here we're going to use 16bit values for training and then we'll enable the paged Optimizer by setting this option equal to paged atom W 8bit. So this is ingredient three from before lots of hyperparameters and of course, you can spend your whole life tuning and tweaking this. But once we have that we can run the training job so we initialize our trainer. We give it the model and give it our training data set, our validation data set, and training arguments.

We defined on the previous slide and then the data collator. We're going to silence warnings this is what I saw on an example from hugging face when they were introducing bits and bites. So I just did it again here and then we can run the training process. This took about 10 minutes to run on Google collab so it's actually pretty quick and this is what will get printed the training loss and validation loss so we can see a smooth monotonic decrease of both implying stable training which is good.

Techniques for Improving Response Quality

Once it's all said and done. We have our model and we can use it. So if we pass that same test comment, "Great content, thank you". We get the response glad you enjoyed it Icoversai GPT and then it even adds this disclaimer that notes I am an AI language model. I can't feel emotions or read articles.

I'm here to answer questions and provide explanations. So this is good I feel like this is exactly how I would respond to this comment. If I wanted to remove the disclaimer I could easily do that with some like string manipulation. Just keep all the text before the sign-off or something like that. But the point is that the fine-tuning process at least from this one example seemed to work pretty nicely.

Exploring Retrieval-Augmented Generation (RAG)

Benefits of Incorporating Specialized Domain Knowledge

Let's try a different comment something more technical like what is fat tailedness? The response of the model is actually similar to what we saw in the previous article. When we fine-tuned the open AI model and then asked it the same question it gave a good concise explanation of tailedness.

The only issue is, it didn't explain fat-tailedness the same way that I explained it in my article series on the topic. So this brings up one of the limitations of fine-tuning which is that it's great for capturing style but it's not always an optimal way to incorporate specialized knowledge into model responses.

Future Directions for Enhancing Model Responses

This brings us to what's next instead of trying to give the model. Even more, examples try to include this specialized knowledge. A simpler approach is that we can improve the model's responses to these types of technical questions by providing it specialized domain knowledge.

The way we can do that is by using a so-called rag system which stands for retrieval augmented generation. Right now we just get the comment and we pass it into the model with the appropriate prompt and it spits out a response. The difference with a rag system is that we take the comment. We use the comment to extract a piece of relevant information from a knowledge base and then we incorporate that into the prompt that we pass into the model so that it can generate a response.

QLoRA—How to Fine-tune an LLM on a Single GPU - icoversai

Table of Contents: