Learn how to compress large language models with techniques like quantization, pruning, and knowledge distillation. Explore Python code examples and practical strategies to reduce model size while maintaining performance.
Table of Contents:
- Introduction: The Power and Challenge of Large Language Models (LLMs)
- What Makes LLMs So Powerful?
- The Problem with Scaling LLMs
- Why Compress Large Language Models?
- The Growing Costs of Bigger Models
- Environmental and Financial Impacts of Large Models
- How Model Compression Can Help
- Overview of Model Compression Techniques
- What is Model Compression?
- Key Benefits of Compressing LLMs
- Method 1: Quantization – Making Models More Efficient
- What is Quantization?
- Post-Training Quantization vs. Quantization-Aware Training
- How to Implement Quantization in Python
- Method 2: Pruning – Removing Unnecessary Model Components
- Unstructured Pruning vs. Structured Pruning
- Implementation of Pruning Techniques
- Method 3: Knowledge Distillation – Learning from a Larger Model
- What is Knowledge Distillation?
- Soft Targets and Synthetic Data in Model Distillation
- Practical Example of Knowledge Distillation in Python
- Combining Compression Techniques for Maximum Efficiency
- Why Combine Quantization, Pruning, and Knowledge Distillation?
- Python Code Example: Compressing an LLM Using Knowledge Distillation and Quantization
- Step-by-Step Guide to Implementing Model Compression
- Loading and Tokenizing Data for Compression
- Evaluation Metrics and Results
- Quantizing a Distilled Model for Even Greater Efficiency
- How to Quantize a Compressed Model
- Comparing Model Sizes and Performance Gains
- Post-Quantization Performance Evaluation
- Conclusion: Achieving Optimal Model Performance with Compression
- Benefits of Compressing LLMs for Real-World Applications
- Final Thoughts and Further Reading
Introduction: The Power and Challenge of Large Language Models (LLMs)
What Makes LLMs So Powerful?
Large language models have demonstrated impressive performance across various use cases. This is largely due to their immense scale. However, deploying these massive models to solve real-world problems can be challenging. In this article, I'll discuss how we can overcome these challenges by compressing LLMs. I'll start with a high-level overview of key concepts and then dive into a specific example with Python code.
The Problem with Scaling LLMs
If you're new here. Welcome, I'm Icoversai. I write articles about data science and entrepreneurship. If you enjoy this content please consider subscribing that's a great no-cost way you can support me in all the articles that I made last year.
The Mantra and AI seemed to be bigger is better where the equation for creating better models was more data plus, more parameters plus, more compute and we can see over time large language models, just kept getting larger and larger.
Why Compress Large Language Models?
The Growing Costs of Bigger Models
This is a figure from reference 11 down here. We can see over time models kept getting bigger and bigger. So in 2018 large meant something around 100 million parameters in 2019 with gpt2. We were up to 1 billion parameters then came gpt3 which was around 100 billion parameters. Then we have more recent language models which have a trillion parameters or more. There's no doubt that this equation actually works. GPT 4 is objectively better than GPT3 and everything that came before it.
However, there's a problem with creating bigger and bigger models. Simply put bigger models come with higher costs. So just to put this into computational terms a 100 billion parameter model is going to take up 200 GB of storage.
Environmental and Financial Impacts of Large Models
If you want to use this model, you have to fit this massive thing into the memory of your machine. So needless to say this comes with high computing costs. So it's probably not something that'll run on your laptop. You're going to need a lot more computing than that it comes with higher Financial costs and then of course it comes with a higher environmental cost.
How Model Compression Can Help
What if there was a way we could make these Large Scale Models a lot smaller. This is the motivation of model compression which aims to reduce the size of a machine learning model without sacrificing its performance.
Overview of Model Compression Techniques
What is Model Compression?
If we're able to pull this off by taking a massive model and shrinking it down to a smaller model, this means we could run one of these models on our laptop or even on other devices like cell phones SmartReades, or other types of devices. Not only does this Foster greater accessibility for this technology.
It also promotes user privacy because these models can be run on a device, and user information does not need to be sent to a remote server for inference. This also means less Financial cost and of course, the negative environmental impact can be a lot smaller.
Key Benefits of Compressing LLMs
Here, I'm going to talk about three different ways we can compress these models. The first is quantization the second is pruning and the third approach is called knowledge distillation.
Method 1: Quantization – Making Models More Efficient
What is Quantization?
Starting with quantization although this might sound like a scary and sophisticated word. It's a very simple idea quantization consists of lowering the Precision of model parameters. You can think of this as taking a high-resolution photo and converting it to a lower resolution. One that still captures the main features of the image and puts this into computational terms.
You might take a model whose model parameters are represented in fp32 so using 32 bits and translating the parameters into int 8 just to get a sense of this. The number seven represented in fp32 looks something like this so the computer has to keep track of all these binary digits just to encode a single parameter. However, if that same parameter is represented in eight that'll look something like this.
It's a fourth of the memory footprint. If you want more details on how quantization Works under the hood I talked more about it in a previous article of this series on Q Laura. I'll link that here for those who are interested when it comes to implementing quantization on large language models.
Post-Training Quantization vs. Quantization-Aware Training
There are two popular categories of approaches. The first is called post-training quantization and the basic idea here is you train your model and then quantize it. The key upside of this is that this allows you to take models that other people have trained. Then you can take them and quantize them without any additional training or data curation using this approach. You can take these off-the-shelf models that might be encoded in fp32 and convert the parameters to 8bit or even 4bit.
How to Implement Quantization in Python
However, if you want to compress Beyond 4bit post-training quantization, typically leads to a degradation in model performance for situations where even more compression is needed. We can turn to another set of approaches known as quantization-aware training this essentially flips the order where one quantizes the model first and then trains IT training models in lower Precision is a powerful way to get compact models that still have good performance.
This means parameters can be encoded with even fewer than four bits. For example, in reference number six the authors were able to create a one-bit model that matched the performance of the original llama model. But of course, the downside of quantization-aware training is that it is significantly more sophisticated than post-training quantization because one has to train the quantized model from scratch.
Method 2: Pruning – Removing Unnecessary Model Components
Unstructured Pruning vs. Structured Pruning
The second compression approach is pruning which consists of removing unnecessary components from a model. So an analogy here is that pruning is like clipping off dead branches from a tree. It reduces the tree's size without harming it. To put this in terms of parameters, we might have a 100 billion parameter model but then through pruning, we might reduce it down to 60 billion parameters. While there is a wide range of pruning approaches out there.
They can be broadly classified into two categories the first is called unstructured pruning which consists of removing individual weights or parameters from the model showing that visually this is our original model here. Unstructured pruning might consist of zeroing out these two weights in the model.
The key benefit of unstructured pruning is that it's operating on this granular scale of individual weights. It can result in a significant reduction of non-trivial model parameters. However, there's a key caveat here. Since we're just taking model weights and turning them into zeros. This is going to result in sparse Matrix operations to get predictions from our model.
In other words, The Matrix multiplications involved in generating a prediction from our model will consist of a lot of zeros and this isn't something that normal Hardware can do any faster than nonsparse Matrix operations. This means that one needs specialized Hardware that's designed to optimize these sparse Matrix operations to realize the benefits of unstructured pruning.
On the other hand, we have structured pruning which instead of removing individual weights. It removes entire structures from the model. This can be think things like attention heads neurons or even entire layers. So visually, what this might look like? If this is our original model, we might remove this entire neuron from the model which does not result in the sparse Matrix operations that we see in unstructured pruning.
Implementation of Pruning Techniques
While this does result in fewer opportunities for model reduction. It allows one to completely remove parameters from the model. If you want to explore specific unstructured and structured pruning techniques, check out reference number five which provides a nice survey of these approaches.
Method 3: Knowledge Distillation – Learning from a Larger Model
What is Knowledge Distillation?
The final way we can compress an LLM is via knowledge distillation. This is where we transfer Knowledge from a larger model into a smaller one. This is just like, how we all learn at school? Where do we have a teacher? Who has much more experience in a particular subject transferring their knowledge to the students in the context of large language models. The teacher model might have 100 billion parameters which are then distilled to a student model which might have just 50 billion parameters again.
Soft Targets and Synthetic Data in Model Distillation
The first is using soft targets which consists of training the student model using the logits from the teacher model. What that means? Let's say we have our teacher model here. Let's say it performs sentiment analysis. So given a chunk of text. It'll label that text as either positive sentiment or negative sentiment. The way these model models work is that the raw outputs are not just a positive or negative prediction but a prediction for each class known as a logit.
For example, let's say the logit for the positive class is 0.85 and the logit for the negative class is minus 0.85. What this is indicating is that the input text is more likely to be positive sentiment than negative sentiment. This is exactly how text generation models like l, 3.1, or gp4 work under the hood. However, instead of having two output logits. These models will have tens of thousands of output lits corresponding to each token in its vocabulary. These lists are then converted into probabilities and then these probabilities can be sampled to generate text, one token at a time.
We can actually use these lots to do knowledge distillation. So the way that works is we'll take our smaller student model and have it generate predictions and then we'll compare those predictions to the teacher model's predictions for the same input text.
The reason these are called Soft targets is because the predictions of the student model aren't compared to a zero or one ground truth. But rather a softer fuzzier probability distribution, this turns out to be an effective strategy because using all the output logits from the teacher model provides richer information to the student model to learn from another. Another way to achieve knowledge distillation is instead of using logits to train the student model one can use synthetic data generated by the teacher model.
Practical Example of Knowledge Distillation in Python
A popular example of this was the alpaca model, which took synthetic data generated from the original chat GPT and used it to perform instruction tuning on llama 7B. In other words, chat GPT was used to generate these input-output pairs of input prompts from users and output responses from the model, which were then used to endow this llama 7B model with the ability to follow instructions and follow user prompts.
Now, with a basic understanding of the key concepts behind model compression. Let's see what this looks like in code as always the example code here is freely available on GitHub. Additionally, all the models derived here and the data set used for training are also freely available on the Hugging Face Hub. We'll be using Python for this example as well as Pytorch. The example here is we're going to take a text classifier and compress it using knowledge distillation and quantization.
Combining Compression Techniques for Maximum Efficiency
Why Combine Quantization, Pruning, and Knowledge Distillation?
One thing I actually forgot to mention is that these three different compression approaches of quantization pruning and knowledge distillation are often independent of one another, which means that we can combine multiple approaches to achieve maximum model compression. So here, I'm going to combine knowledge distillation with quantization to achieve a 7x reduction in model size.
Python Code Example: Compressing an LLM Using Knowledge Distillation and Quantization
Step-by-Step Guide to Implementing Model Compression
The first step here is to do some imports many of these are hugging face libraries. So data sets are from hugging faces. We'll import some things from Transformers. We're going to import some things from Pytorch and then finally, I'm going to import some evaluation metrics from Psyit Learn. I'll import the data set with one line of code and this is something I've made available on the hugging face Hub. It consists of a training testing and validation data set with a 70515 split. So it's 2100 examples of data in the training data set and then 450 examples in the testing and validation and the data set consists of two columns.
The First Column is website URLs and the second column is a binary label. Whether that URL is a fishing website or not a fishing website. So this is actually a very practical use case used by email providers or cyber security folks who may want to ensure that links are safe before presenting them to end users with the data loaded in. We'll load in our teacher model here to speed things up. I used the freely available GPU on Google Collab. So I'm importing that GPU as a device here.
Loading and Tokenizing Data for Compression
Next, I'm going to load in the teacher model which is a model I utilized on this fishing classification task. We can load in the model's tokenizer and the model itself using these two lines of code here and then using these two methods. I'm loading the model onto the GPU then we can load it into our student model. So here we're going to create a model from scratch. I'm going to copy the architecture of distilling bir to initialize the model. However, I'm going to drop four of the attention heads from each layer additionally. I'm going to drop two of the layers from the model.
In other words, each attention layer has 12 attention heads. So I'm going to reduce that to eight and the original architecture has six layers. I'm going to reduce that down to four. Then, I'm going to use this distill BT for sequence classification objects. What that does mean?
It'll load in this distill BT architecture with these modifications and then slap on top of it a classification head. In other words, the model instead of generating text is going to perform text classification. We're also going to load the student model onto the GPU just to get a sense of the scale. Here the teacher model has 109 million parameters and takes up 438 megab of memory. While the student model here consists of 52.8 million parameters and takes up 211 MB of memory.
The reason, I'm using relatively small models by today's standard is that this is what I can easily run on the free GPU on collab. But if you have beefier GPUs or more computing at your disposal, you can take this code and just plug in bigger models and it should work just fine.
So the data set that we loaded in consists of plain text with a label. Before we can actually use this data, we'll need to tokenize it. Here, I defined a simple pre-processing strategy. So what's happening here is each URL is being converted into a sequence of Tokens. The Tokens are being truncated. So they're not too long and then within each batch of examples, the shorter sequences are going to be padded. So all the examples have the same length and this is important.
We can convert it into a PyTorch tensor and efficiently do the computation with the GPUs. This is the pre-processing function the actual transformation happens in this line of code. We take the data set and then map it into tokenized Data making sure that we are using batches and then we convert it into a pytorch format, where we have columns for the tokens the attention mask, and the target labels.
Evaluation Metrics and Results
Another thing, we need to do is Define an evaluation function. So this will allow us to compute evaluation metrics during model training. A lot is happening here. First, we're putting the model into eval mode instead of training mode. We're initializing two lists. One list is for model predictions. Another list is for labels. Here, we're going to disable gradient calculations, then batch by batch we're going to do the following.
First, we're going to load all the data onto the GPU. So that's the input tokens the attention mask and the labels. Then, we're going to perform the forward pass. So we're going to compute model outputs and then we're going to extract the logits from the outputs. This lits variable here will actually consist of two numbers. One corresponds to the probability that the URL is fishing. Another corresponds to the probability that the URL is not fishing.
In other words, the URL is fishing or is not fishing. We can take the ARG Max of this logit variable and then that'll be our prediction. We can append the predictions and the ground truth label to these lists. We initialized earlier. Once we do that for all the batches, we can compute the accuracy precision recall, and F1 score for all the data in one go.
Next, we're going to define a custom loss function and the way we're going to do that is we're going to use both soft Targets. In other words, the logits from the teacher model and the ground truth labels. So the way we're doing that is we're going to compute a distillation loss as well as a hard loss. Then we're going to combine those into a final loss. So to get the distillation loss, we'll first compute the soft targets. So these are the teacher's logits and then we're going to convert those logits into probabilities to generate probabilities from the teacher models logits.
We can use the soft Max function and it's common practice to divide the teacher logits by a temperature parameter which will increase the entropy of the probability distribution. We generate a probability distribution corresponding to the teacher's prediction and then a probability distribution corresponding to the student's prediction. Now, that we have two probability distributions. One from the teacher model and one from the student model. We can compare their differences using the KL Divergence Pytorch has a built-in method that does that. So we can just easily compute the difference between these two probability distributions using this line of code here and then we can compute the hard loss.
So instead of comparing the student model's predictions to the teacher model's predictions. We're going to compare them to the ground truth label and then we'll use the cross entropy loss to compare those probabilities distributions. Finally, we can combine these losses by adding them together and adding this Alpha parameter which controls how much weight we're giving to the distillation loss versus, the hard loss.
Next, we'll Define the hyperparameter. So here I use a badge size of 32 and put the learning rate as .001. We'll do five Epochs and we'll set the temperature that we use in our loss function at two and then we'll set the alpha. So the relative weight of the distillation loss versus, the hard loss is 0.5. We'll give equal weight to both types of losses, and then we'll Define our Optimizer. So we'll use atom and then, we'll create two data loaders. We'll have a data loader to control the flow of batches for the training data as well as the testing data. Then, we'll train the model using Pytorch. So we put the student model into train mode and then train it.
We have two for Loops here. So we have one for the epoch. One for the batches and it's a similar thing to what we saw in the evaluation function. So we'll load each batch onto the GPU. We'll compute the outputs of the teacher model and then since we're not training the teacher model. There's no need to calculate gradients. So we can avoid that using this syntax here. Then we'll pass through the student model to generate its outputs and extract its logits.
Python Code Example: Compressing an LLM Using Knowledge Distillation and Quantization
Step-by-Step Guide to Implementing Model Compression
We'll compute the loss value using the distillation loss that we defined earlier. Then we'll perform the backpropagation. Sorry! the script was too long, so I had to extend it like this. But once we make it through every single batch, we can print the performance metric tricks after each Epoch. So we'll print the accuracy precision-recall F1 score for the teacher model and then the accuracy precision-recall F1 score for the student model and then we'll be sure to put the student model back into train mode because this evaluate model function that we defined earlier puts it into eval mode.
I know this was a ton of code maybe way more code, than you were hoping to get into. But here are the results of the training. So we have five Epochs here and we can see the loss is going down which is a good sign. So it bumped up in Epoch 4, but then it dropped back down in Epoch 5 which is very normal and then we can compare the performance of the teacher and student models. So of course, since we're not updating the teacher model, its accuracy is going to stay the same across all Epoch cuz. It's not changing but we can see the student model performance get better and better across each Epoch and then once we get to Epoch number five, the student model is actually performing better than the teacher across all evaluation metrics.
Comparing Model Sizes and Performance Gains
Next, we can evaluate the performance of the teacher and student models using the independent validation data set. So the training set is used to update model parameters. The testing data set is used in tuning the hyperparameters of the model and the validation set wasn't touched. So this will give us a fair evaluation of each model and for that.
We again see that the student model is performing better than the teacher model across all evaluation metrics. This is one of the other upsides of model compression if your base model. If your teacher model is overparameterized meaning that it has way too many internal parameters relative to the task that it's trying to achieve actually compressing the model not only reduces the memory footprint. But also it can lead to better performance because it removes a lot of the noisy and redundant structures within the model. We can go one step further, so we did knowledge distillation.
Post-Quantization Performance Evaluation
Let's see how we can quantize this model? First, I'll push the student model to the hugging face Hub and then we'll load it back in using the bits and bytes integration in the Transformers Library. So we'll use the bits and bytes config. So we'll load it in for a bit we'll use the normal float data type described in the Cur paper and all this is a clever way of doing the quantization to take advantage that model parameters tend to be normally distributed. So you can be a bit more clever in how you quantize the values and I talk more about that in the Cur article that I mentioned earlier.
Next, we'll set the compute data type as brain float 16 and then finally, we'll do double quantization which is another thing described in the Cur paper. Once we set up this config, we can simply just load our student model from the hugging face hub using this config file. So the result of that is we have still the same number of parameters 52.8 million, but we've reduced the memory footprint. We went from 21 mega down to 62.7 mega and then compared that to our original teacher model.
We started by cutting the number of model parameters in half and then we reduced the memory footprint by about 7x, but we're not done yet. So just cuz, we reduced the model size that doesn't mean that we still maintain the performance.
So now, let's evaluate the performance of the quantized model. Here we see, that we actually get another performance game post-quantization. So intuitively, we can understand this through Aam's razor principle which says that simpler models are better. So this might indicate that there's even more opportunity in knowledge distillation for this specific task.
Conclusion: Achieving Optimal Model Performance with Compression
Benefits of Compressing LLMs for Real-World Applications
So that brings us to the end. If you enjoyed this article and you want to learn more check out the blog in towards data science. Although this is a member-only story like all my other articles, you can access it completely for free using the friend Link in the description below.
Final Thoughts and Further Reading
Additionally, if you enjoyed this article you may enjoy the other articles in my llm series and you can check those out by clicking on the playlist linked here as always thank you so much for your time and thanks for reading.