Table of Contents:
- Introduction to the Series on Full Stack Data Science
- What We’ve Covered So Far
- Overview of Today’s Article: From Data to AI Solutions
- Understanding the Differences Between Traditional Software Development and Machine Learning
- Explicit Logic vs. Data-Driven Learning
- Predictability and Interpretability in Software vs. ML
- Iterative Development in Machine Learning
- The Role of Experimentation in Machine Learning
- Why Experimentation is Key in Data Science
- The Experimentation Process: An Overview
- Data Collection and Preparation
- Iterative Development and Evaluation of Machine Learning Solutions
- Building a Semantic Search System: A Hands-On Example
- What is Semantic Search?
- Key Components of a Semantic Search System
- Design Choices in Developing a Semantic Search Tool
- Deciding What Text to Use: Title, Paragraphs, or Full Documents
- Chunking vs. Summarizing: Optimizing Text for Embeddings
- Choosing and Combining Embedding Models
- Measuring Query and Document Similarity
- Experimenting with Semantic Search: A Practical Project
- Setting Up the Experiment: Data and Tools
- Generating Text Embeddings: Methods and Models
- Evaluating Search Methods: Distance and Similarity Metrics
- Automating and Evaluating Multiple Configurations
- Creating and Testing Various Configuration Options
- Handling Different Embedding Models and Metrics
- Results and Analysis of Semantic Search Experiments
- Comparing Performance Across Configurations
- Interpreting Evaluation Metrics and Rankings
- Next Steps: Deploying Your Machine Learning Solution
- Moving from Experimentation to Production
- Deploying Containerized ML Solutions with APIs on AWS
- Conclusion and Future Directions
- Recap of Key Learnings
Introduction to the Series on Full Stack Data Science
What We’ve Covered So Far
This is the fourth article in a larger series on full stack data science. In the previous article of the series, I discussed how we can make data pipelines for machine learning projects. Here, I'll discuss the next stage in the ML pipeline which is how we can use data to build AI Solutions.
Understanding the Differences Between Traditional Software Development and Machine Learning
Explicit Logic vs. Data-Driven Learning
Several key differences are important to keep in mind. The first and most fundamental is that in traditional software development, the rules and the logic that make up the program are explicitly written into the computer by the programmer. However, when it comes to machine learning computers aren't told what to do explicitly. But rather the rules or the instructions of the program are learned from data directly. This allows us to build ML solutions for things we could never write traditional software for, such as text generation or autonomous driving.
This indirect way of programming computers gives rise to a few other key differences for one the behavior of traditional software systems is typically predictable. In other words, given any input for a traditional Software System, you can typically know what the output is going to be on the other hand the behavior of machine Learning Systems is a bit more unpredictable. You don't always know how the system will react to particular edge cases. No matter how many tests you come up with to evaluate your system.
There will always be examples that you can't take into consideration because there are an infinite number of them. Another key difference is that traditional software systems are usually interpretable meaning. You can usually have an intuitive understanding of how a software system takes any given input and generates a specific output.
Predictability and Interpretability in Software vs. ML
On the other hand machine learning systems are often uninterpretable or at least. They're not interpretable in the same way that traditional software systems are. So even though a machine Learning System can often generate better performance than a traditional Software System that often comes at the cost of interpretability.
Iterative Development in Machine Learning
Then finally, traditional software development typically has a linear development cycle or at least a clear development cycle. In other words, projects can progress predictably on the other hand developing machine Learning Systems is often iterative and progress might be made in a nonlinear type of way. These differences create several Downstream consequences and how we should think about machine learning development as opposed to traditional software development.
The Role of Experimentation in Machine Learning
Why Experimentation is Key in Data Science
The main thing I want to focus on here is the role of experimentation the way I see it, this is what makes data science closer to something. A scientist might do rather than an engineer more specifically, scientists typically have hypotheses that they'll test against experiments.
While Engineers are typically implementing a given design of course. It's not always this black and white in practice but experimenting with multiple potential Solutions is a key role of a data scientist.
The Experimentation Process: An Overview
Data Collection and Preparation
So what this typically looks like is represented by this flow chart here. So what this is representing is that we have the real world which is full of things that are happening some things. We care about some things, we don't care about what we do, when we want to build ML Solutions we collect data about some of the things that we care about in the real world. Then we make that data available so that we can develop a machine learning solution with it.
Iterative Development and Evaluation of Machine Learning Solutions
Once we have a candidate solution we can evaluate the efficacy or the value of that solution typically this results in a set of feedback loops. So you might evaluate a solution and see that the performance isn't so great. So you go back and you tweak some parameters and you evaluate it again and then you tweak some more parameters and you keep going in this feedback loop. This Loop might even be automated. You may exhaustively search a bunch of different parameters and still not get the results that you want.
So you decide to go back and change the data set that you're using for your solution development and perhaps repeat this whole process. Finally, you might realize that the data that you have available isn't sufficient to develop Your solution. So you go back to the real world and you re-evaluate the data that you need. This is why developing ml Solutions is iterative and often nonlinear.
Building a Semantic Search System: A Hands-On Example
What is Semantic Search?
You might go through hundreds of iterations of your solution before finally realizing that you weren't collecting sufficient data and then once you grab one key variable, for example, and pass it into your model you find that. You finally get the performance that you need and the value is generated so to make this a bit more concrete.
Key Components of a Semantic Search System
Let's look at a specific example. Let's say we wanted to develop a semantic search system. This is something I've talked about in a couple previous articles of the series including the one on Rag and the one on text embeddings. But if you're not familiar with semantic search. The basic idea is that we start with a set of documents and then we take these documents and we generate numerical representations of them which we call text embeddings. Then what we can do is develop a Search tool where a user can type in a query.
We can generate a numerical representation of this query and then we can evaluate which documents are closest to the user's query and return them as search results, it's called semantic search. Because rather than using specific keywords in the user's query, the meaning of the query and the meaning of the documents are captured by these numerical representations. Since I have an article "All About Text Edings", I won't go into the details here.
I'll link that article in case you want to learn more while this might seem like a pretty straightforward idea. Take documents, generate text embeddings, and then do some kind of similarity score between the query and all the different documents. Several design choices come up.
Design Choices in Developing a Semantic Search Tool
Deciding What Text to Use: Title, Paragraphs, or Full Documents
When developing this system for them. For example, given documents documents have a lot of text in them. So what text do we want to use? For example, if these are blog articles do we just want to use the title. Do we just want to use the first paragraph of the blog? Do we want to use the entire blog?
Chunking vs. Summarizing: Optimizing Text for Embeddings
Another question is should we summarize the text? You have a long document. Maybe you want to summarize it. So you capture the key information before passing it into an embedding model. But of course, there's more.
Choosing and Combining Embedding Models
What embedding model do you want to choose? There are several readily available models, both open-source models and closed Source models also should we embed multiple parts of a document. If you have an article, again do you want to embed the title and the B body of the document separately and then maybe combine them in some way and then talk about the Search tool like how do you want to measure the distance between a query and all the different documents.
How should we filter results? You have millions of documents. It might be a good idea to narrow down the candidates before applying the semantic search because it's a bit more computationally expensive and then should we use meta tags. You want to add tags to documents to help with this filtering process. So all that to say countless design choices come up. When developing any machine learning solution and even everything, I discussed here is far from an exhaustive list.
So to make this even more concrete. Let's look at a real-world example of building a semantic search system. Here I'm going to walk through a project that I'm currently building to perform semantic search. Over all of my Website articles, this project has been the focus of this larger Series, where in the previous article we built the data pipeline for this project. We started with the data source which was the Website API.
We saw how we could build a data pipeline for this project. I extracted information about all my Website articles from the Website API. I did some light Transformations and then I loaded them into a data store specifically a parent file. In this article, I'm going to walk through the experimentation piece of building this semantic Search tool.
So we're going to take that part file file which includes things like the article's ID. For its title and transcript, we're going to generate text embeddings and then we'll build a Search tool with a user interface. Here there are a few design choices that I will experiment with specifically, whether we should base the search on the article's title its transcripts or both picking an embedding model from three op- Source options.
Measuring Query and Document Similarity
Finally, define the metric or how we're going to define the similarity between the query and all the different articles. There will be five options for looking through this. If we have three options time, three options Time 5 options, these are 45 different options for this semantic search system and of course, these aren't things that we're going to hardcode one by one. I'll show how we can automatically generate all of these Solutions and objectively compare them to one another using an evaluation metric with that highly overview of what we're going to do.
I'm going to jump into the code which is available on the GitHub link here. I'll also put it in the description and comment section below. So before jumping into the code, let's just see what the final product looks like by the end of this. We'll have a user interface like this where we can type in a query and then it'll spit out responses. The formatting doesn't look great cuz it's just a PC, but we can see if I type in something like LLM, it'll return a bunch of articles from my channel as well as links to them.
So that's pretty cool and then we can search something else what are fat tails and then we go. We get all my articles on fat-tailedness. Let's see how can I build a semantic search system. All right, so this is the perfect article to return cuz, I literally walk through it in this article. We'll come back to this and play around with it a bit more, but anyway I'm going to walk through three different notebooks all available on the GitHub repository.
The first one is going to be the experimentation piece where we're going to Loop through all 45 different options and compare them all to each other using an evaluation metric. Once we figure out which of the 45 options is best, we'll create an article index based on that configuration. Then finally we'll write the search function and create the user interface starting from the top.
First I import polers which help us handle the data structures and polers if you're unfamiliar. Basically like pandas but it's much faster and is gaining popularity rapidly. The project was a good excuse for me to try out polers and so far I've enjoyed the experience then we import sentence Transformers which has a handful of open-source text embedding models. We can use and then we import some distance metrics from sklearn.
The distance metrics will allow us to evaluate how similar a user's query is to each article in the data set. We'll import numpy to work with the matrices that we get from the search function and then I import M plotlib which I may or may not use. But this is a great thing to have whenever you're doing any sort of experimentation of machine learning models so you can plot things like histograms and Scatter Plots to compare the performance of different solutions.
Experimenting with Semantic Search: A Practical Project
Setting Up the Experiment: Data and Tools
First, we load the data like any other machine learning project. The way I do it here is I have two data sets. One is a data set of the transcripts saved in the article- transcripts. par. It's a data set containing all of my Website articles and Website shorts. So have all my article IDs. The dates were posted the title of the content, and the transcript, this is just the head we can also look at the shape. So I have 83 articles very small data set by ML standards, but it took a long time to make those 83 articles.
Next, we have this evaluation data set which consists of two columns. One is an example query and the other is the ground truth article associated with that query. The point of this evaluation data set is to give us a way to objectively compare multiple potential solutions to one another. So whether you're training a model from scratch or you're using a model off the shelf like, we're doing in this example.
You need to have an evaluation data set so you can effectively compare multiple candidate Solutions together. We can also look at the shape of this data set and so we see we have 64 examples.
Next, I'm doing some data preparation. What I'm doing here is I'm going to Loop through each title and transcript in the original data frame. So each of these titles and each of these transcripts I'm going to Loop through three different embedding models available in the sentence Transformers Library. So two different columns with three different models give us six possible configurations in this chunk of code.
I Loop through every possible combination. So you'll have the title with these three models and then you'll have the transcript with these three models. So six possible combinations. I'll Loop through each one and generate the embedding. So what that looks like is nested for loop. So I have a for loop for the model name and I have a for loop for the column names. I'm going to store everything in a dictionary. So I initialize that here and now just walking through this code.
Generating Text Embeddings: Methods and Model
First, we define the embedding model that we want to use. We set the model equal to the sentence Transformers model name and then once we have the model, we can generate an embedding for a particular column. Here, I Define a key, so we have a unique identifier for each element in the dictionary and then in this line of code, I'll use the model to generate the text embeddings for every piece of text in that column. For example, if we're encoding the title, this will take the title column of the data frame convert it to a list, and then pass it into this encode function and spit out an array of all the embeddings.
Evaluating Search Methods: Distance and Similarity Metrics
Finally, we'll store the key name and embedding array in the dictionary. So the key name is just going to be a unique ID. It'll be the model name with the column name and then we'll have the embedding array for that combination. If we look at the embedding array, that's going to be 83 by 768. So we have 83 articles and then the text embedding has 768 Dimensions. So that's where this number comes from and of course, each embedding model will be different.
Another thing we can look at is this text embedding dictionary view of that, we'll see that we have the model name appended by the column that we're embedding and then we'll have a numpy array with all the numbers associated with each text embedding. So if we look at this one specifically, we see it's a numpy array and then we can look at its shape and then we see this one is 83x,384 notice that different embedding models will have different embedding Dimensions. So this one is actually smaller than the other one which would have been this small model, yeah. So this model has 768 while the other one has 364 or whatever it was. I already forgot to go back to this time function.
Automating and Evaluating Multiple Configurations
Creating and Testing Various Configuration Options
This is really handy when it comes to doing these experiments because it'll automatically spit out the time. It took to run this line of code. Here this is helpful because it can allow us to get a rough idea of the computational cost of each of these configurations. So we can see that generating embeddings for the transcripts tends to take longer than for just the titles with this case being an exception maybe. There's some kind of startup cost with running the first one and then these models tend to have different costs associated with them.
The reason is that they actually get bigger and bigger. Another thing, I'll share is that if we go to the sentence Transformers documentation. They have a handful of pre-trained models. Here, let's see all mini LM six, yeah okay. So this is one that we're using it's actually the smallest one and we can see that it's 80 megab while the largest one that we're using multi QA mpet. The largest one we're using is more than five times as large at 420 megabytes. So these are all important things to take into consideration not just the performance of the solution, but the computational cost associated with it because that plays a role as well.
Another thing going back this code might be difficult to read or seem a little complicated because we have these nested for loops and we don't really know the model names and column names. They're stored in this list. Here some may have the inclination to want to hardcode all of these things. For example, just taking this line of code of defining the model name and then this line of code of generating the embedding array and then copy-pasting something like this, we'll take the model name embedding array doing this and then tweaking it and then repeating that for this and then so on and so forth.
In a sense, this might be simpler when it comes to doing experimentation across multiple potential Solutions. This is an absolute nightmare because say you take this to your team or you read an article talking about how great this other model is if you wanted to go back and change your code. It's a lot to keep track of cuz. Now you got to change it here and then maybe two cells down, you use the model name again, and then you got to think about keeping track of this, and then if you're copy-pasting inputs like this you're bound to make a typo and then it's going to cause issues down the line.
That is the number one reason why I could not recommend enough to write your code something like this. where you have somewhere where you basically Define all the different options that you're trying to play with and then just let the code run its magic below and print out all the results that you need to see manually going in and tweaking code blocks
Handling Different Embedding Models and Metrics
Here is going to inevitably lead to errors and this is just something, I learned the hard way in grad school, where I would train a model, present it to the research group and they're like, oh! that's amazing. But what if you tweaked this and what if you tried this and then I'm like, oh! okay. So I'd go back but then my code wasn't written like this. A lot of manual tweaking and then I would mess things up and things would stop running. Then I would finally get it working and take it back to the group and then they would come up with some other suggestions.
So writing it this way allows you to iterate much faster and helps you avoid a lot of headaches that were a bit of a lecture there. But it's super important, the next block of code basically does the same thing but instead of embedding the titles and the transcripts for each Website article, doing it for each of the queries in the evaluation data set. This code is a bit simpler.
Since we don't have to iterate through the column names, but it's exactly the same then we move on to evaluating the different search methods. Here, I Define a handful of functions that we can just skip for now and I'll return back to them as we come across them in the code. But here, I'm doing a similar thing as before I'm listing all the different ways.
We can evaluate the similarity between the query and a particular article. Here I list three different distance metrics from pit learn then two different similarity metrics from the sentence Transformers Library. We're going to evaluate all possible combinations of model columns to embed and distance metrics or similarity scores. So again this is 45 different combinations even if you could have hardcoded them. The last six combinations do not hardcode 45 different configurations just write the four Loops similar situation. Here we're going to Loop through the models.
Results and Analysis of Semantic Search Experiments
Comparing Performance Across Configurations
Here, I'm grabbing the text embeddings for all 64 queries in the evaluation data set. So I stored them all in this query embedding dick. If we look at this thing. We see it's a numpy array and then we'll have a row for each query and then we'll have a column for each embedding Dimension. We're going to Loop through all the text columns and we're going to pull the text embeddings for that particular column.
First, we'll start with the title. This is going to pull the text embeddings of the titles for every one of the articles looking at that this will also be a numpy array, but we see that the number of rows is 83 because I have 83 articles. Then finally, we have a third for Loop because we're going to Loop through each of the distance metrics this will get us this disc object which we can use to compute pairwise distances for all the articles and all the queries. So this final thing will be an array of distances.
We can look at the shape and notice that there are 83 rows corresponding to 83 articles and 64 columns corresponding to 64 queries in the evaluation data set. Each element of this array will be the distance between the E article and the J query. For example, if we looked at the very first element this would be the distance between the first query in our evaluation data set and the first article in our article index.
We're going to use this ARG sort function from numpy to sort each of the columns and so if we go back to the disc array, we have 83 rows and 64 columns. So if we sort each column, we're going to rank the articles from smallest distance to largest distance for each of the 64 queries. Since it's ARG sword instead of returning the ordered values themselves it's going to return the index of the values in ascending order.
Interpreting Evaluation Metrics and Rankings
Next, I Define a method name and this is essentially like. We did before where we had a unique name for each combination of model and column. But here we're going to combine the model name the column name and the distance name. So each of the 45 configurations for this Search tool has a unique name. So here I use a function that I defined called evaluate true rankings which evaluates the ranking of the ground truth. In other words, for a given query we have 83 possible articles to return, but only one ground truth in the evaluation data set. What this function does is that it returns earns the ranking of the ground Truth for each of the 64 queries that function is defined here.
Next Steps: Deploying Your Machine Learning Solution
Moving from Experimentation to Production
Deploying Containerized ML Solutions with APIs on AWS
This brings me to the next article in this series, where I'll talk about what I call phase three of any machine learning project. This is where we deploy our ML solution into the real world. So in the next article, I'm going to walk through three main things.
First developing a real API not just a Pudo API that can access this search function. Second, containerizing the search function and its API to make that functionality much more portable, and then finally deploying that container of code onto AWS.