This article delves into the intricate process of building large language models, covering key aspects such as financial considerations, data curation, model architecture, training techniques, and evaluation strategies. From understanding the costs involved to optimizing training processes and assessing model performance, it provides a comprehensive overview for developers and researchers venturing into the realm of large language model development.
Table Of Contents
Introduction to Building Large Language Models
- Overview of the evolution of large language model development.
- Importance of understanding considerations for building large language models.
- Financial Costs and Considerations
Breakdown of the financial costs associated with training large language models.
- Comparison of computational costs for different model sizes.
- Options for renting or buying hardware for training.
- Data Curation: The Backbone of Model Building
Importance of high-quality training data in large language model development.
- Sources of training data, including web scraping, public datasets, and private data sources.
- Challenges and strategies in preparing and curating training data.
- Model Architecture: Transforming Inputs to Outputs
Overview of Transformer architecture as the state-of-the-art for large language models.
- Explaining encoder-only, decoder-only, and encoder-decoder Transformer models.
- Key design choices within the Transformer architecture, including residual connections, layer normalization, and activation functions.
- Training at Scale: Techniques and Optimization
Overview of training techniques to handle the computational challenges of large language models.
- Explanation of mixed precision training, 3D parallelism, and training stability strategies like checkpointing, weight decay, and gradient clipping.
- Discussion on hyperparameters such as batch size, learning rate, optimizer, and dropout.
- Evaluation: Assessing Model Performance
Introduction to model evaluation and its importance in assessing model effectiveness.
- Overview of benchmark datasets like Arc, SWAG, MMLU, and TruthfulQA.
- Strategies for evaluating large language models, particularly for multiple-choice tasks.
- Considerations for Model Deployment and Application
Considerations for deploying large language models in real-world applications.
- Implications of Large Language Models on Ethics and Society
- Future directions and challenges in the field of large language model development.
Introduction to Building Large Language Models
Hey, everyone, I'm Sha and this is the sixth video in the larger series on how to use large language models in practice in this video I'm going to review key aspects and considerations for building a large language model from scratch if you Googled this topic even just one year ago you'd probably see something very different than we see today building large language models was a very esoteric and specialized activity reserved mainly for Cutting Edge AI research but today if you Google how to build an LLM from scratch or should I build a large language model.
You'll see a much different story with all the excitement surrounding large language models post-chat GPT we now have an environment where a lot of businesses Enterprises and other organizations have an interest in building these models perhaps one of the most notable examples comes from Bloomberg in Bloomberg GPT which is a large language model that was specifically built to handle tasks in the space of Finance however the way I see it building a large language model from scratch is often not necessary for the vast majority of LLM use cases using something like prompt engineering or fine-tuning in the existing model is going to be much better suited than building a large language model from scratch.
Overview of the evolution of large language model development.
With that being said, it is valuable to better understand what it takes to build one of these models from scratch and when it might make sense to do it. Before diving into the technical aspects of building a large language model, let's do some back-of-the-napkin math to get a sense of the financial costs that we're talking about here.
Taking as a baseline Llama 2, the relatively recent large language model put out by Meta, these were the computational costs associated with the 7 billion parameter version and 70 billion parameter versions of the model. So, you can see for Llama 2 7b, it took about 180,000 GPU hours to train that model, while for the 70b model, 10 times as large, it required 10 times as much computing. This required 1.7 million GPU hours. If we just do what physicists love to do...
Importance of understanding considerations for building large language models.
We can just take orders of magnitude and based on the Llama 2 numbers we'll say a 10 billion parameter model takes on the order of 100,000 GPU hours to train while a 100 billion parameter model takes about a million GPU hours to train so how can we trans at this into a dollar amount here we have two options option one is we can rent the GPUs and compute that we need to train our model via any of the big cloud providers out there an Nvidia a100 what was used to train llama 2 is going to be on the order of $1 to $2 per GPU per hour so just doing some simple multiplication.
Here that means the 10 billion parameter model is going to be on the order of1 15 $50,000 just to train and the 100 billion parameter model will be on the order of $1.5 million to train alternatively instead of renting the computer you can always buy the hardware, in that case, we just have to take into consideration the price of these GPUs so let's say an a100 is about $110,000 and you want to form a GPU cluster which is about 1,000 GPUs the hardware costs alone are going to be.
Financial Costs and Considerations
on the order of like $10 million but that's not the only cost when you're running a cluster like this for weeks it consumes a tremendous amount of energy and so you also have to take into account the energy cost so let's say training a 100 billion parameter model consumes about 1,000-megawatt hours of energy and let's just say the price of energy is about $100 per megawatt hour then that means the marginal cost of training a 100 billion parameter model is going to be on the order of $100,000.
Breakdown of the financial costs associated with training large language models.
Comparison of computational costs for different model sizes.
So now that you've realized you probably won't be training a large language model anytime soon or maybe you are I don't know let's dive into the technical aspects of building one of these models I'm going to break the process down into four steps one is data curation two is the model architecture three is training the model at scale and four is evaluating the model okay so starting with data curation I would assert that this is the most important and perhaps most time-consuming part of the process and this comes from the basic principle of machine learning of garbage in garbage out put another way the quality of your model is driven by the quality of your data.
So it's super important that you get the training data right especially if you're going to be investing millions of dollars in this model but this presents a problem large language models require large training data sets and so just to get a sense of this GPT3 was trained on half a trillion tokens llama 2 was trained on two trillion tokens and the more recent Falcon 180b was trained on 3.5 trillion Tokens and if you're not familiar with tokens you can check out the previous video in the series.
Options for renting or buying hardware for training.
Where I talk more about what tokens are and why they're important but here we can say that as far as training data go we're talking about a trillion words of text or in other words about a million novels or a billion news articles so we're talking about a tremendous amount of data going through a trillion words of text and ensuring data quality is a tremendous effort and undertaking.
So, a natural question is: where do we even get all this text? The most common place is the internet. The internet consists of web pages, Wikipedia, forums, books, scientific articles, code bases—you name it. Post-J GPT, there's a lot more controversy around this and copyright laws. The risk with web scraping yourself is that you might grab data that you're not supposed to grab or you don't have the right to grab, and then using it in a model for potential commercial use could come back and cause some trouble down the line.
Alternatively, there are many public datasets out there. One of the most popular is Common Crawl, which is a huge corpus of text from the internet. Then there are some more refined versions such as Colossal Clean Crawled Corpus, also called C4. There's also Falcon Refined Web, which was used to train Falcon 180b mentioned on the previous slide. Another popular dataset is The Pile, which tries to bring together a wide variety of diverse data sources into the training dataset, which we'll talk a bit more about in the next.
Slide and then we have hugging Face which has really emerged as a big player in the generative AI and large language model space that houses a ton of Open Access Data sources on their platform another place is private data sources so a great example of this is fin pile which was used to train Bloomberg GPD and the key upside of private data sources is you own the rights to it and it's data that no one else has which can give you a strategic Advantage if you're trying to build a model for some
Data Curation: The Backbone of Model Building
Business applications or some other application where there's some competition or environment of other players that are also making their own large language models finally and perhaps the most interesting is using an LLM to generate the training data a notable example of this comes from the alpaca model put out by researchers at Stanford and what they did was they trained an LLM alpaca using structured text generated by gpt3 this is my cartoon version of it you pass on the prompt make me training data into your Large language model.
It spits out the training data for you turning to the point of data set diversity that I mentioned briefly with the pile one aspect of a good training data set seems to be data set diversity and the idea here is that a diverse data set translates to a model that can perform well in a wide variety of tasks essentially it translates into a good general purpose model here.
I've listed out a few different models and the composition of their training data sets so you can see gpt3 is mainly web pages but also some books you see gopher is mainly web pages but they got more books and then they also have some code in there llama is mainly web pages but they also have books code and scientific articles.
Then Palm is mainly built on conversational data but then you see it's trained on web pages books and code how you curate your training data set is going to drive the types of tasks the large language model will be good at and while we're far away from an exact science or theory of this particular data set composition translates to this type of model or like adding an additional 3% code in your trading data set will have this quantifiable outcome in the downstream model.
While we're far away from that diversity does seem to be an important consideration when making your training data sets another thing that's important to ask ourselves is how do we prepare the data again the quality of our model is driven by the quality of our data so one needs to be thoughtful with the text that they use to generate a large language model and here I'm going to talk about four key data preparation steps the first is quality filtering this is removing text which is not helpful to the large language model this could be just a bunch of random gibberish from some corner.
This could be toxic language or hate speech found on some Forum this could be things that are objectively false like 2 + 2al 5 which you'll see in the book 1984 while that text exists out there it is not a true statement there's a really nice paper it's called a survey of large language models I think and in that paper, they distinguish two types of quality filtering the first is classifier based and this is where you take a small high-quality data set and use it to train a text classification model that allows you to automatically score text as either good or bad low quality or high quality.
So that precludes the need for a human to read a trillion words of text to assess its quality it can kind of be offloaded to this classifier the other type of approach they Define is heuristic-based, using various rules of thumb to filter the text this could be removing specific words like explicit text this could be if a word repeats more than two times in a sentence you remove it or using various statistical properties of the text to do the filtering and of course, you can do a combination of the two you can use the classifier based method to distill down your data set and then on top of that, you can do some heuristics or vice versa you can use heuristics to distill down the data set.
Importance of high-quality training data in large language model development.
Sources of training data, including web scraping, public datasets, and private data sources.
Then apply your classifier there's no one- size-fits-all recipe for doing a quality filter rather there's a menu of many different options and approaches that one can take next is D duplication which is removing several instances of the same or very similar text and the reason this is important is that duplicate texts can bias the model and disrupt training namely if you have some web page that exists on two different domains one ends up in the training data set one ends up in the testing data set this causes some trouble.
In trying to get a fair assessment of model performance during training another key step is privacy redaction especially for text grabs from the internet it might include sensitive or confidential information it's important to remove this text because if sensitive information makes its way into the training data set it could be inadvertently learned by the language model and be exposed in unexpected ways finally we have the tokenization step which is essentially translating text into numbers and the reason.
Challenges and strategies in preparing and curating training data.
This is important because neural networks do not understand text directly they understand numbers so anytime you feed something into a neural network it needs to come in numerical form while there are many ways to do this mapping one of the most popular ways is via the bite pair encoding algorithm which essentially takes a corpus of text and deres from it an efficient subword vocabulary it figures out the best choice of subwords or character sequences to define a vocabulary from which the entire Corpus can be represented.
For example, maybe the word efficient gets mapped to an integer and exists in the vocabulary maybe sub with a dash gets mapped to its own integer word gets mapped to its own integer vocab gets mapped to its own integer and UL gets mapped to its own integer so this string of text here efficient subword vocabulary might be translated into five tokens each with their own numerical representation so one two three four five there are python libraries out there that implement this algorithm so you don't have to do it from scratch namely.
Model Architecture: Transforming Inputs to Outputs
There's the sentence piece python Library there's also the tokenizer library coming from hugging face here the citation numbers and I provide the link in the description and comment section below moving on to step two model architecture so in this step, we need to define the architecture of the language model and as far as large language models go Transformers have emerged merged as the state-of-the-art architecture and a Transformer is a neural network architecture that strictly uses attention mechanisms to map inputs to outputs.
So you might ask what an attention mechanism is and here I Define it as something that learns dependencies between different elements of a sequence based on position and content this is based on the intuition that when you're talking about the language the context matters and so let's look at a couple examples so if we see the sentence I hit the base baseball with a bat the appearance of baseball implies that bat is probably a baseball bat and not a nocturnal mammal this is the picture.
We have in our minds this is an example of the content of the context of the word bat so bat exists in this larger context of this sentence and the content is the words making up this context the content of the context drives what word is going to come next and the meaning of this word here but the content isn't enough the positioning of these words is also important so to see that consider another example I hit the bat with a baseball now there's a bit more ambiguity.
What bat means It could still mean a baseball bat but people don't really hit baseball bats with baseballs they hit baseballs with baseball bats one might reasonably think bad here means the nocturnal mammal and so an attention mechanism captures both these aspects of language more specifically it will use both the content of the sequence and the positions of each element in the sequence to help infer.
What the next word should be well at first it might seem that Transformers are constrained in particular architecture we actually have an incredible amount of freedom and choices we can make as developers making a Transformer model so at a high level there are actually three types of Transformers which follow from the two modules that exist in the Transformer architecture namely we have the encoder and decoder.
Overview of Transformer architecture as the state-of-the-art for large language models.
Explaining encoder-only, decoder-only, and encoder-decoder Transformer models.
So we can have an encoder by itself that can be the architecture we can have a decoder by itself that's another architecture and then we can have the encoder and decoder working together and that's the third type of transformer so let's take a look at this One By One The encoder Transformer translates tokens into a semantically meaningful representation and these are typically good for Tech classification tasks.
If you're just trying to generate an embedding for some text next we have the decoder-only Transformer which is similar to an encoder because it translates text into a semantically meaningful internal representation but decoders are trying to predict the next word they're trying to predict future tokens and for these decoders do not allow self-attention with future elements.
This makes it great for text generation tasks so just to get a bit more intuition of the difference between the encoder self-attention mechanism and the decoder self-attention mechanism the encoder any part of the sequence can interact with any other part of the sequence if we were to zoom in on the weight matrices that are generating these internal representations in the encoder you'll see that none of the weights are zero, on the other hand, a decoder it uses so-called masked self-attention so any weights.
Key design choices within the Transformer architecture, including residual connections, layer normalization, and activation functions.
That would connect a token to a token in the future that is going to be set to zero it doesn't make sense for the decoder to see into the future if it's trying to predict the future that would kind of be like cheating and then finally we can combine the encoder and decoder together to create another choice of model architecture this was actually the original design of the Transformer model kind.
What's depicted here and what you can do with the encoder-decoder model what you can't do with the others is the so-called cross attention so instead of just being restricted to self-attention with the encoder or mask self-attention with the decoder the encoder-decoder model allows for cross attention where the embeddings from the encoder so this will generate a sequence and the internal embeddings of the decoder.
Training at Scale: Techniques and Optimization
Which will be another sequence will have this attention weight Matrix so that the encoder representations can communicate with the decoder representations this tends to be good for tasks such as translation which was the original application of this Transformers model while we do have three options to choose from when it comes to making a Transformer the most popular by far is this decoder only architecture where you're only using this part of the Transformer to do the language modeling.
This is also called causal language modeling which basically means that given a sequence of text, you want to predict future text Beyond just this high-level choice of model architecture there are actually a lot of other design choices and details that one needs to take into consideration first is the use of residual connections which are just Connections in your model architecture that allow intermediate training values to bypass various hidden layers and so to make this more concrete this is from reference number 18 Linked In the description and comment section below what this looks like is you have some input and instead of strictly feeding the input into your hidden layer which is this stack of things.
Overview of training techniques to handle the computational challenges of large language models.
Explanation of mixed precision training, 3D parallelism, and training stability strategies like checkpointing, weight decay, and gradient clipping.
Here you allow it to go to both the hidden layer and to bypass the hidden layer then you can aggregate the original input and the output of the Hidden layer in some way to generate the input for the next layer of course there are many different ways one can do this with all the different details that can go into a hidden layer you can have the input and the output of the Hidden layer be added together and then have an activation applied to the addition you can have the input and the output of the Hidden layer be added and then you can do some kind of normalization.
Discussion on hyperparameters such as batch size, learning rate, optimizer, and dropout.
Then you can add the activation or you can have the original input and the output of the Hidden layer just be added together you really have a tremendous amount of flexibility and design choice when it comes to these residual Connections in the original Transformers architecture the way they did it was something similar to this where the input bypasses this multiheaded attention layer and is added and normalized with the output of this multi attention layer.
Then the same thing happens for this layer and the same thing happens for layer next is layer normalization which is rescaling values between layers based on their mean and standard deviation so when it comes to layer normalization there are two considerations that we can make one is where you normalize so there are generally two options here you can normalize before the layer also called pre-layer normalization or you can normalize after the layer also called post layer normalization.
Another consideration is how you normalize one of the most common ways is via layer norm and this is the equation here this is your input X you subtract the mean of the input and then you divide it by the variance plus some noise term then you multiply it by some gain factor and then you can have some bias term as well an alternative to this is the root mean Square Norm or RMS Norm which is very similar it just doesn't have the mean term in the numerator.
Evaluation: Assessing Model Performance
Then it replaces this denominator with just the RMS while you have a few different options on how you do layer normalization the most common based on that survey of large language models I mentioned earlier reference number eight pre-layer normalization seems to be the most common combined with this vanilla layer Norm approach next we have activation functions and these are non-linear functions that we can include in the model which in principle allow it to capture comp Lex mappings between inputs and outputs here there are several common choices for large language models namely gel relo swish swish Glu G Glu and I'm sure there are more but glues seem to be the most common for large language models another design Choice.
Introduction to model evaluation and its importance in assessing model effectiveness.
Overview of benchmark datasets like Arc, SWAG, MMLU, and TruthfulQA.
How We Do Position Embeddings Position embeddings capture information about token positions the way that this was done in the original Transformers paper was using these sign and cosine basic functions which added a unique value to each token position to represent its position and you can see in the original Transformers architecture you had your tokenized input and the positional encodings were just added to the tokenized input for both the encoder input and the decoder input.
More recently there's this idea of relative positional encodings so instead of just adding some fixed positional encoding before the input is passed into the model the idea with relative positional encodings is to bake positional encodings into the attention mechanism. So I won't dive into the details of that here but I will provide this reference self-attention with relative position representations and citation number 20 as the last consideration.
What I'll talk about when it comes to model architecture is how big I make it and the reason this is important is that if a model is too big or trained too long it can overfit on the other hand if a model is too small or not trained long enough it can underperform and these are both in the context of the training data and so there's this relationship between the number of parameters the number of computations or training time and the size of the training data set.
There's a nice paper by Hoffman where they do an analysis of optimal compute considerations when it comes to large language models I've just grabbed a table from that paper that summarizes their key findings and what this is saying is that a 400 million parameter model should undergo on the order of let's say like 2 to the 19 floating Point operations and have a training data consisting of 8 billion tokens and then a parameter.
With 1 billion models should have 10 times as many floating Point operations and be trained on 20 billion parameters and so on and so forth my kind of summarization takeaway from this is that you should have about 20 tokens per model mod parameter it's not going to be very precise but might be a good rule of thumb and then we have for every 10x increase in model parameters there's about a 100x increase in floating Point operations so if you're curious about this check out the paper Linked In the description below even if this isn't an optimal approach.
In all cases, it may be a good starting place and rule of thumb for training these models so now we come to step three which is training these models at scale so again the central challenge of these large language models is their scale when you're training on trillions of tokens and you're talking about billions tens of billions hundreds of billions of parameters there's a lot of computational costs associated with these things and it is basically impossible to train one of these models without employing some computational tricks and techniques.
Strategies for evaluating large language models, particularly for multiple-choice tasks.
To speed up the training process here I'm going to talk about three popular training techniques the first is mixed Precision training which is essentially when you use both 32bit and 16-bit floating Point numbers during model training such that you use the 16bit floating Point numbers whenever possible and 32bit numbers only when you have to move on mixed Precision training.
In that survey of large language models and then there's also a nice documentation by Nvidia linked below next is this approach of 3D parallelism which is actually the combination of three different parallelization strategies which are all listed here and I'll just go through them one by one first is pipeline parallelism which is Distributing the Transformer layers across multiple GPUs and it actually does an additional optimization.
Where it puts adjacent layers on the same GPU to reduce the amount of cross-GPU communication that has to take place the next is model parallelism which basically decomposes The Matrix multiplications that make up the model into smaller Matrix multiplies and then distributes those Matrix multiplies across multiple GPUs and then and then finally there's data parallelism which distributes training data across multiple GPUs.
But one of the challenges with parallelization is that redundancies start to emerge because model parameters and Optimizer States need to be copied across multiple GPUs so you're having some portion of the GPU's precious memory devoted to storing information that's copied in multiple places this is where zero redundancy Optimizer or zero is helpful which essentially reduces data redundancy regarding the optimizer.
State the gradient and parameter partitioning and so this was just like a surface-level survey of these three training techniques these techniques and many more are implemented by the great speed Python library and of course, great speed isn't the only Library out there there are few other ones such as colossal AI Alpa and some more which I talk about in the blog associated with this video another consideration.
When training these massive models is training stability and it turns out there are a few things that we can do to help ensure that the training process goes smoothly the first is checkpointing which takes a snapshot of model artifacts so training can resume from that point this is helpful because let's say you're training loss is going down it's great but then you just have this spike in loss after training for a week and it just blows up training.
You don't know what happened checkpointing allows you to go back to when everything was okay and debug what could have gone wrong and maybe make some adjustments to the learning rate or other hyperparameters so that you can try to avoid that spike in the loss function that came up later another strategy is weight Decay which is essentially a regularization strategy that penalizes large parameter values.
Considerations for Model Deployment and Application
I've seen two ways of doing this: either by adding a term to the objective function which is like regular regularization or changing the parameter update Rule and then finally we have gradient clipping which rescales the gradient of the objective function if it exceeds a pre-specified value so this helps avoid the exploding gradient problem.
Which may blow up your training process and then the last thing I want to talk about when it comes to training are hyperparameters while these aren't specific to large language models my goal here is to just lay out some common choices when it comes to these values so first we have a batch size which can be either static or dynamic and if it's static batch sizes are usually pretty big so on the order of like 16 million tokens.
But it can also be dynamic for example in GPT 3 what they did is they gradually increased the batch size from 32,000 tokens to 3.2 million tokens next we have the learning rate so this can also be static or dynamic but it seems that Dynamic learning rates are much more common for these models a common strategy seems to go as follows you have a learning rate that increases linearly until reaching some specified maximum value and then it'll reduce via a cosine Decay.
Until the learning rate is about 10% % of its max value next we have the optimizer atom or atom-based optimizers are most commonly used for large language models and then finally we have Dropout typical values for Dropout are between 0.2 and 0.5 from the original Dropout paper by Hinton at all finally step four is model evaluation so just cuz you've trained your model and you've spent millions of dollars and weeks of your time if not more it's still not over typically.
When you have a model in hand that's really just the starting place in many ways next you have to see what this thing actually does and how it works in the context of the desired use case application this is where model evaluation becomes important for this there are many Benchmark data sets out there here I'm going to restrict the discussion to the open LLM leaderboard which is a public llm Benchmark.
That is continually updated with new models un hugging faces models platform and the four benchmarks that are used in the open El M leaderboard are Arc H swag MML and truthful QA while these are only four of many possible Benchmark data sets the evaluation strategies that we can use for these Benchmark data sets can easily port to other benchmarks so first I want to start with just Arc helis swagen MML U.
Which are multiple-choice tasks so a bit more about these Ark and MML U are essentially great school questions on subjects like math history common knowledge you know whatever and it'll be like a question with a multiple-choice response A B C or D so an example is which technology was developed most recently a cell phone B a microwave c a refrigerator and D an airplane H swag is a little bit different these are specific questions that computers tend to struggle.
With so an example of this is in the blog associated with this video which goes like this a woman is outside with a bucket ET and a dog the dog is running around trying to avoid a bath she dot dot dot rinses the bucket off with soap and blow dries the dog's head B uses a hose to keep it from getting soapy C gets the dog wet then it runs away again D gets into a bathtub with a dog and so this is a very strange question.
But intuitively humans tend to do very well on these tasks and computers do not so while these are multiple-choice tasks and we might think it should be pretty straightforward to evaluate model performance on them there is one hiccup namely these large language models are typically text generation models so they'll take someinput text.
They'll output more text they're not classifiers they don't generate responses like ABC or D or class one class 2 class 3 class 4 they just generate text completions and so you have to do a little trick to get these large language models to perform multiple choice tasks and this is essentially through prompt templates for example if you have the question which technology was developed most recently instead of just passing in this question and the choices to the large language model and hopefully it figures out to do
Considerations for deploying large language models in real-world applications.
Implications of Large Language Models on Ethics and Society
A BC or D you can use a prompt template like this and additionally prend the prompt template with a few shot examples so the language model will pick up that I should return just a single token that is one of these four tokens here so if you pass this into to the model.
You'll get a distribution of probabilities for each possible token and what you can do then is just evaluate all the tens of thousands of tokens that are possible you just pick the four tokens associated with a B C or D and see which one is most likely and you take that to be the predicted answer from the large language model while there is this like extra step of creating a prompt template you can still evaluate a large language model.
On these multiple choice tasks and in a relatively straightforward way however, this is a bit more tricky when you have open-ended tasks such as for truthful QA or other open-ended tasks where there isn't a specific right answer but rather a wide range of possible right answers there are a few different evaluation strategies.
We can take the first is human evaluation so a person scores the completion based on some ground truth some guidelines or both while this is the most labor int ensive may provide the highest quality assessment of model completions another strategy is we could use NLP metrics so this is trying to quantify the completion quality using metrics such as perplexity blue score row score Etc.
So just using the statistical properties of the completion as a way to quantify its quality is a lot less labor intensive it's not always clear what the mapping between a completion statistical properties is to the quality of that completion and then the third approach which might capture The Best of Both Worlds is to use an auxiliary fine-tuned model to rate the quality of the completions and this was actually used in the truthful QA paper should be reference 30.
Future directions and challenges in the field of large language model development.
Where they created an auxiliary model called GPT judge which would take model completions and classify it as either truthful or not truthful and then that would help reduce the burden of human evaluation when evaluating model outputs okay so what's next so you've created your large language model from scratch what do you do next often this isn't the end of the story.
As the name base models might suggest base models are typically a starting point, not the final solution they are really just a starting place for you to build something more practical on top of and there are generally two directions here one is via prompt engineering and prompt engineering is just feeding things into the language model and harvesting their completions for some particular use case.
Another Direction one can go is via model fine-tuning which is where you take the pre-trained model and adapt it for a particular use case prompt engineering and model fine-tuning both have their pros and cons if you want to learn more check out the previous two videos of this series.
Where I do a deep dive into each of these approaches if you enjoyed this content please consider liking subscribing and sharing it with others if you have any questions or suggestions for future content please drop those in the comment section below and as always thank you so much for your time and thanks for watching.