How to Fine Tune Large Language Models? A Comprehensive Guide

Discover the power of model fine-tuning for large language models through insightful analysis and practical examples. Learn about approaches like self-supervised learning, supervised learning, and reinforcement learning. Explore how techniques like Low-Rank Adaptation (LoRA) can optimize model performance with fewer parameters. Dive into implementation steps, evaluation metrics, and considerations for overcoming overfitting. Unlock the potential of fine-tuning for various applications like sentiment analysis, and gain valuable insights for your own model fine-tuning endeavors.

Introduction to Model Fine-Tuning
Understanding the Concept of Fine-Tuning
Comparison: Base Model vs. Fine-Tuned Model
Benefits and Applications of Model Fine-Tuning
Approaches to Model Fine-Tuning
Example: Fine-Tuning a Linguistic Model for Sentiment Analysis
Implementation: Using Low Rank Adaptation (LoRA)
Evaluation and Performance Metrics
Overfitting Considerations in Fine-Tuning
Conclusion and Future Directions

Introduction to Model Fine-Tuning

Hey, everyone, I'm Shaw and this is the fifth in the larger series on "How to use Large Language Models" in practice in the previous article we talked about prompt engineering which is concerned with using large language models out of the box while prompt engineering is a very powerful approach and can handle a lot of llm use cases in practice for some applications prompt engineering just doesn't cut it and for those cases, we can go one step further and fine-tune an existing large language model for a specific use case so Navy's question is what is model fine-tuning the way I like to Define it is by taking a pre-trained model and training at least one internal model parameter and here I mean the internal weights or biases inside the neural network what this typically looks like is taking a pre-trained existing model like gpt3 and fine-tuning it for a particular use case.

For example, a chatbot to use an analogy here gpt3 is like a raw diamond right out of the earth it's a diamond but it's a bit rough around the edges fine-tuning is taking this raw diamond and transforming it into something a bit more practical something that you can put on a diamond ring for example so the process of taking the raw base model of gpt3 and transforming it into the fine-tuned model of gpt 3.5 turbo, for example, is what gives us applications like chat GPT or any of the other incredible applications of large language models we're seeing these days to get a more concrete sense of the difference between a base model link gpt3 and a fine-tuned model let's look at this particular example.

We have to keep in mind that these Foundation large language models like gpt3 llama 2 or whatever your favorite large language model is these models are strictly trained to do word prediction given a sequence of words predicting the next word so when you train one of these launch language models on huge Corpus of text and documents and web pages what it essentially becomes is a document completer what that translates to in practice is if you plug into a lot of these base models like gpt3 the prompt tell me how to find tune a model a typical completion might look something like this.

Where it's just listing out questions like you might see in a Google search or maybe like a homework assignment or something here when I prompted gpt3 to tell me how to fine-tune a model the completion was as follows how can I control the complexity of a model how do I know when my model is done how do I test a model well this might be reasonable for gpt3 to do based on the data that it was trained on this isn't something that's very practical now let's look at the fine-tuned model completion so now we have text DaVinci zero three which is just one of the many fine-tuned models based on gpt3 coming from open AI we give it the same prompt.

Understanding the Concept of Fine-Tuning

Tell me how to fine tune a model and this is the completion fine-tuning a model involves a adjusting the parameters of a pre-trained model in order to make it better suited for a given task there are generally three steps involved to fine-tuning a model select a base model adjust parameters train the model while this completion may not be perfect it's much more aligned what we were hoping to get out of the language model compared to the base model's completion so if you want to learn more about model fine tuning and how open AI did their fine tuning their alignment tuning and instruction tuning check out the references in the description and comment section below so as we saw when comparing a base model to a fine-tuned model we see that the fine-tune model can generate completions that are much more aligned and desirable for our particular use case Beyond this performance there's actually a deeper reason why you might want to fine tune and that is the observation that a smaller fine-tuned model can often outperform a larger base model.

This was demonstrated by open AI in their instruct GPT model where're small 1.3 billion parameters fine-tuned instruct GPT model generated completions that were preferred to gpt3 completions even though gpt3 had about 100 times as many internal parameters this is one of the biggest upsides of fine-tuning you don't have to rely on some massive general purpose large language model to have good performance in a particular use case or application now that we have a better understanding of what fine-tuning is and why it's so great to let's look at three possible ways one can fine-tune an existing large language model the first is via self-supervised learning this is the same way these base models and Foundation large language models are trained so, in other words, you get your training Corpus of text and you train the model in a self-supervised way, in other words, you take a sequence of text like listen to your you feed it into the model.

You have it predict a completion if we feed in listen to your it might spit out hard what would differentiate fine tuning with self-supervised learning versus just training a base model through self-supervised learning is that you can curate your training Corpus to align with whatever application that you're going to use the fine-tuned model for example if I wanted to fine-tune gpt3 to write text in the likeness of Me Maybe I would feed it a bunch of my torch data science blogs and then that resulting fine-tuned model might be able to generate completions that are more like my style the second way we can fine-tune a model is via supervised learning this is where we have a training data set consisting of inputs.

Comparison: Base Model vs. Fine-Tuned Model

Associated outputs or targets for example if we have a set of question-answer pairs such as who was the 35th President of the United States and then the answer is John F Kennedy we can use this question answer pair to fine-tune an existing model to learn how to better answer questions so the reason this might be helpful as we saw before if we were to just feed in who was the 35th President of the United States into a base model the completion that it might generate is who was the 36th president of the United States who was the 40th President of the United States who is the speaker of the house so on and so forth but through having these question answered pairs we can fine-tune the model to essentially learn how to answer questions but there's a little trick here these language models are again document completers so we actually have to massage these input-output pairs a bit before we can feed it into our large language model for training one simple way we can do this is via prompt templates for example.

We could generate a template please answer the following question where we input the question here so our input would go here. Then we input the target here so the answer would go here and then through this process we can translate our training data set to a set of prompts and generate a training Corpus and then go back to the self-supervised approach and the final way one can fine-tune an existing model is via reinforcement learning. At the same time, there are many ways one could do this I'm going to just focus on the approach outlined by open Ai and generating their instruct GPT models which consisted of three steps the first was supervised fine-tuning so essentially what we were talking about in this second way to fine-tune a model this consists of two steps one curating your training data set and then two fine-tuning the model.

The Next Step was to train a reward model and all this is essentially a model that can generate a score for a language model's completion so if it generates a good completion it'll have a high score it generates a bad completion it'll generate a low score so what this looked like for the instruct GPT case was as follows you start with a prompt and you pass it into your supervised fine-tuned model from here but you don't just do it once you actually do it many times so you generate multiple completions for the same prompt then you get human labelers to rank the responses from worst to best and then you can use that ranking to train the reward model which is indicated by this Square here and then the final step is to do reinforcement learning with your favorite reinforcement learning algorithm.

In the case of instruct GPT, they used proximal policy optimization, or PPO for short what this looks like is you take the prompt and pass it into your supervised fine-tuned model and then you pass that completion to the reward model and then the reward model will essentially give feedback to the fine-tuned model and this is how you can update the model parameters and eventually end up with a model that's fine-tuned even further I know this was a ton of information but if you want to dive deeper into any one of these approaches check out the blog in towards data science where I go into a bit more detail on each of these approaches okay so to keep things relatively simple for the remainder of the article we'll be focused just on the supervised learning approach to model fine tuning here.

Benefits and Applications of Model Fine-Tuning

I break that process down into five steps first choose your fine-tuning task so which could be text summarization it could be text generation it could be binary classification text classification whatever it is you want to do next you prepare your training data set if you're trying to do text summarization, for example, you would want to have input-output pairs of text and the desired summarization and then you take those input Alpha Pairs and generate a training Corpus using prompt templates for example next you want to choose your base model there are many Foundation large language models out there or there are many existing fine-tuned large language models out there and you can choose either of these as your starting place.

We can fine-tune the model via supervised learning and then finally we evaluate model performance there are certainly a lot of details in each of these steps but here I'm just going to focus on step number four fine-tuning the model with supervised learning and here I want to talk about three different options we have when it comes to updating the model parameters the first option is to retrain all the parameters given our neural network given our language model we go in and we tweak all the parameters but perhaps obviously this comes with the downside when you're talking about billions tens of billions hundreds of billions of internal model parameters.

The computational cost for training explodes even if you're doing the most efficient tricks to speed up the training process retraining a billion parameters is going to be expensive another option we can do is transfer learning and this is essentially where we take our language model and instead of retraining all the parameters we freeze most of the parameters and only fine tune the head namely we fine-tune the last few layers of the model where the model embeddings or internal representations are translated into the target or the output layer and while transfer learning is a lot cheaper than retraining all parameters there is still another approach that we can do which is the so-called parameter efficient fine tuning this is where we take our language model and instead of just phrasing a subset of the weights we freeze all of the weights we don't change any internal model parameters instead what we do is we augment the model.

- Additional parameters can be trained to fine-tune a model with a small set of new parameters.

- Low-rank adaptation (LORA) is a popular approach for this fine-tuning.

- LORA adds new trainable parameters to the model.

- In a neural network, each layer involves a mapping from inputs to hidden layers.

- The mapping can be represented by a weight matrix.

- Without LORA, all parameters in the weight matrix are trainable.

- This leads to a large number of trainable parameters.

- LORA helps reduce the number of trainable parameters by introducing additional parameters.

Approaches to Model Fine-Tuning

What that looks like mathematically is we have the W naught times x is equal to H of X like we saw in the previous slide but now we're adding this additional term here which is Delta W Times X this is going to be another weight Matrix the same shape as W naught and looking at this you might think sha how does this help us we just doubled the number of parameters yeah sure if we keep W naught Frozen we still have Delta W with the same number of parameters to deal with but let's say that we Define delta W to be the multiplication of two matrices b and a, in this case, our hidden layer becomes W naught times X Plus ba times x looking at this more visually we have W naught which is the same weight.

Matrix we saw in the previous slide but now we have b and a which have far fewer terms than W naught does and then what we can do is through matrix multiplication generate a matrix of the proper size namely Delta W adds it to W naught multiplies all that by X and generates our h of X looking at the dimensionality of these things W naught and Delta W live in the same space they're matrices of d by KB is going to be a matrix of d by R A is going to be a matrix of R by K and then h of X is going to be D by 1.

Approach 1

The key thing here is this R number what the authors of this method called the intrinsic rank of the model the reason that this works and we get the efficiency gains is that this R is a lot smaller than D and K to see how this plays out unlike before where W naught was trainable now these parameters are going to be Frozen and B and a are trainable and maybe as you can just tell visually from the area of this rectangle versus the areas of these two rectangles b and a contain far fewer terms than W naught to make this a bit more concrete let's say d is equal to a thousand K is equal to a thousand and our intrinsic rank is equal to two what this translates to is 4 000 trainable parameters as opposed to the million trainable parameters we saw in the previous slide this is the power of low raw it allows you to fine-tune a model with far fewer trainable parameters if you want to learn more about Laura check out the paper Linked In the description below or if you want something that's a bit more accessible check out the blog and towards data science where I talk about this a bit more let's dive into some example code and how.

We can use low raw to fine-tune a large language model here I'm going to use the hugging face ecosystem namely pulling from libraries Like data sets Transformers p e f t and evaluate which are all hugging face python libraries also importing in a pi torch and numpy for some extra things with our Imports the next step is to choose our base model here I use distill burnt uncased which is a base model available on hugging faces model repository this is what the model card looks like we can see that it only has 67 million parameters in it and then there's a lot more information about it on the model card here we're gonna take distillery uncased and we're gonna fine tune it to do sentiment analysis we're going to have it take in some text and generate a label of either positive or negative based on the sentiment of the input text so to do.

Approach 2

We need to Define some label Maps so here we're just defining that 0 is going to be negative and one is going to mean positive and vice versa that negative means zero and positive means one now we can take these label maps and we can take our model checkpoint and we can plug it into this Nifty Auto model for sequence classification and class available from the Transformers library and very easily we import this base model specifically ready to do binary classification the way this works is that hugging face has all these base models and has many versions of them where they replace the head of the model for many different tasks and we can get a better sense of this from the Transformers documentation as shown here.

You can see that this Auto model for sequence classification has a lot of Base models that it can build on top of here we're using distilber which is a smaller version of bird here but there are several models you can choose from the reason I went with distillbird is because it only has 67 million parameters and it can actually run on my machine the next step is to load the data set so here I've actually just made the data set available on the hugging face data set repository so you should be able to load it pretty easily it's called IMDb truncated it's a data set of IMDb movie reviews with an Associated positive or negative label if we print the data set it looks something like this there are two parts to it there's this train part and then there's this validation part and then you can see that both the training and validation data sets have 1000 rows in them this is another great thing about model fine tuning is that while training a large language model from scratch may require trillions of tokens or a trillion words in your training.

Example: Fine-Tuning a Linguistic Model for Sentiment Analysis

Corpus fine-tuning a model requires far fewer examples here we're only going to be using a thousand examples for model fine tuning the next step is to pre-process the data here the most important thing is we need to create a tokenizer if you've been keeping up with this series you know that tokenization is a critical step when working with large language models because learner networks do not understand text they understand numbers and so we need to convert the text that we pass into the large language model into a numerical form so that it can actually understand it so here we can use the auto tokenizer class from Transformers to grab the tokenizer for the particular base model.

That we're working with next we can create a tokenization function this is a function that defines how we will take each example from our training data set and translate it from text to numbers this will take in examples which is coming from our training data set and you see we're extracting the text so going back to the previous slide you can see that our training data set has two features it has a label and a piece of text so you can imagine each row of this training data set has text and it has a label associated with that text so when we go over here the examples is just like a row from this data set and we're grabbing the text from that example.

That we do is we Define the side that we want to truncate truncation is important because the examples that we pass into the model for training need to be the same length we can either achieve this by truncating long sequences or padding short sequences to a predetermined fixed length or a combination of the two so we're just choosing the current truncation side to be left and here we're tokenizing the text here's our tokenizer that we defined up here passing in the text we're returning numpy tensors we're doing the truncation.

We defined how to do that here and then we're defining our max length and this will return our tokenized inputs since the tokenizer does not have a pad token this is a special token that you can add to a sequence which will essentially be ignored by the large language model here we're adding a pad token and then we're updating the model to handle this additional token that we just created finally we've applied this tokenize function to all the data in our data set using this map method here we have our data set and we plan this map method we pass in our tokenized function and it'll output a tokenized version of our data set to see what the output looks like we have another data set dictionary we have the training and validation data sets but now you see we have these additional features.

Implementation: Using Low Rank Adaptation (LoRA)

We don't only have the text in the label but we also have input IDs and we also have this attention mask one other thing we can do at this point is to create a data collator this is essentially something that will dynamically pad examples in a given batch to be as long as the longest sequence in that batch for example if we have four examples in our batch the longest sequence has 500 but the others have shorter ones it'll dynamically pad the shorter sequences to match the longer one and the reason why this is helpful is that if you pad your sequences dynamically like this with a collater it's a lot more computationally efficient than padding all your examples across all 1000 training examples because you might just have one very long sequence at 512 that is creating unnecessary data that you have to process next we want to define a valuation metrics so this is how we will monitor the performance of the model during training.

I just did something simple I'm going to import the accurac see from the evaluate python Library so we can package our evaluation strategy into a function that here I'm going to call compute metrics and so here we're not restricted to just using one evaluation metric or even just using accuracy as an evaluation metric but just to keep things simple here I just stick with accuracy here we take a model output and we unpack it into a prediction and label the predictions here are the logits and so it's going to have two elements one associated with the negative class and one associated with the Positive class and all this is doing is evaluating which element is larger and whichever one is larger is going to become the label so if the zeroth element is larger the ARG Max will return zero.

That'll become the model prediction and vice versa if the first element is the largest this will return a one and then that will become the model prediction and then here we just compute accuracy by comparing the model prediction to the ground truth label so before training our find tuned model we can evaluate the performance of the base model out of the box so let's see what that looks like here we're going to generate a list of examples such as it was good not a fan don't recommend the better than the first one this is not worth watching even once and then this one is a pass then what we do is for each piece of text in this list.

Evaluation and Performance Metrics

We're gonna tokenize it and compute the logits so basically we're going to pass it into the model and take the logits out then we're going to convert the logits to a label either a zero or one and so the output looks like this we have the untrained model predictions it was good the model says this has a negative sentiment, not a fan don't recommend the model says this as a negative sentiment so that's correct better than the first one the model.

Says this has a negative sentiment even though that's probably positive this is not worth watching even once the model says it's a negative sentiment which is correct and then this one is a pass and the model signs a negative sentiment to them as you can see it got two out of five correctly essentially this model is as good as chance as flipping a coin it's right about half the time which is what we would expect from this unfine-tuned based model so now let's see how we can use Lora to fine-tune this model and hopefully get some better performance the first thing we need to do is Define our low configuration parameters first is the task type we're saying we're going to be doing sequence classification.

Define the intrinsic rank of the trainable weight matrices so that was that smaller number that allowed b and a to have far fewer parameters than just W naught next we Define the lower Alpha value which is essentially a parameter that's like the learning rate when using the atom Optimizer then we Define the low raw Dropout which is just the probability of Dropout and that's where we randomly zero internal parameters during training finally we Define which modules we want to apply low raw to and so here we're only going to apply it to the query layers.

Then we can use these configuration settings and update our model to get another model but one that is ready to be fine-tuned using low raw and so that's pretty easy we just use this get p e f t model by passing in our original model and then our config from above then we can easily print the number of trainable parameters in our model and we can see it's about a million out of this 67 million that are in the base model so you can see that we're going to be fine-tuning less than two percent of the model parameters so it's just huge cost savings like 50 times fewer model parameters than if we were to do the full parameter fine-tuning.

Overfitting Considerations in Fine-Tuning

Define our hyperparameters and training arguments so here we put the learning rate as .001 put the batch size as four and the number of epochs as 10. next, we say where we want the model to be saved here I dynamically create a name so it'll be the model checkpoint Dash low raw text classification learning rates what we defined before batch size is what we put before finding weight Decay as 0.01 then we set the evaluation strategy as Epoch so every Epoch is going to compute those evaluation metrics the safe strategy is every Epoch is going to save the model parameters.

Then load the best model at the end so at the end of training it's going to return us the best version of the model then we just plug everything into this trainer class trainer takes in the model and takes in the learning arguments it takes in our training and validation data sets it takes in our tokenizer it takes in our data collator and it takes in our evaluation metrics put that all into this trainer class and then we train the model using this dot train method so during training these metrics will be generated.

We can see the epochs the training loss the validation loss and the accuracy as you can see the training loss is decreasing which is good and the accuracy is increasing which is good but you can see that the validation loss is increasing so this is a sign of overfitting which I'll comment on in a bit here now that we have our fine-tuned model in hand.

We can evaluate its performance on those same five examples that we evaluated before 5 fine tunings basically the same code copy pasted but here's the different output the text was good and is now correctly being classified as positive not a fan don't recommend it correctly classified as negative better than the first one correctly classified as positive and then this is not worth watching even one correctly classified as negative and then this one is a pass it's classified as positive but this one's a little tricky even though

Conclusion and Future Directions

We don't get perfect performance on these five like baby examples we do see that the model is performing a little bit better so returning back to the overfitting problem this example is meant to be more instructive than practical in practice before jumping too low raw one thing we might have tried is to Simply do transfer learning to see how close we can get to something that does sentiment analysis well after doing the transfer learning then maybe we would use low Rod to fine-tune the model even further, either way, I hope this example was instructed and gave you an idea of how you can start fine-tuning your very own large language models if you enjoyed this content please consider liking subscribing and sharing it with others if you have any questions or suggestions for future content please feel free to drop those in the comment section below and as always thank you so much for your time and thanks for reading.

How to Fine Tune Large Language Models? A Comprehensive Guide - icoversai

Table of Contents: