How to Build Data Pipelines for ML Projects

Learn how to implement an ETL process for machine learning projects using YouTube API. Discover tips on data extraction, transformation, transcript retrieval, and creating text embeddings for semantic search systems.

Introduction to Extract, Transform, Load (ETL) in Machine Learning Projects

What is ETL and Why It’s Important for Machine Learning?

When you think of machine learning, fancy algorithms, and techniques probably come to mind. While these things definitely play a role. The most important part of an ML solution is the data used to develop it in this article. I will discuss the most critical data engineering skills for building ml Solutions end to end and walk through a concrete example with Python code.

If you're new here welcome, I'm Icoversai. I make articles about data science and Entrepreneurship and if you enjoyed this content, please consider subscribing that's a great no-cost way you can support me in all the content that I make.

Overview of the YouTube API and Its Role in Data Extraction

Data engineering is all about making data readily available for analytics and ML use cases. So at a high level what this looks like is, that we have raw data that comes from the real world, and data engineering is taking that data and making it use-case ready. So that Downstream users can create spreadsheets from it.

Build dashboards or even train machine learning models. While this consists of a wide range of skills such as data modeling, creating a schema for databases, and even managing distributed systems for large-scale data sets in the context of building machine learning projects end to end.

Getting Started with Data Extraction Using YouTube API

How to Set Up and Use the YouTube API for Data Retrieval

This comes down to one key skill building data pipelines. Basically, a data pipeline gets data from point A to point B. So if this is point A and this is point B, then the data pipeline is what connects these two things together. NE necessary components of a data pipeline are an extraction process. The data needs to be pulled from The Source in some way. Another necessary component is a loading process the data needs to be loaded to the Final Destination also called a data sync.

So the simplest data pipeline would just consist of extracting data from point A and directly loading it to point B. However, for a lot of machine learning applications, it makes sense to have a step in between the extraction and the loading processes and to transform the data in some way. These three steps of extract transform and load are the three key elements of any data Pipeline these days.

There are two prevailing paradigms for combining these three elements. These two pipelines are ETL which stands for extract transform and load, which is what we saw on the previous slide, and extract load and transform which I'll talk about in a minute.

Key Parameters in YouTube API Calls Explained

Starting with ETL. Data are extracted from a source transformed in some way and then loaded and typically the data are loaded into a database or a data warehouse. This makes it easy for a business analyst or a data scientist to query the data for their particular use case. Although this is great because when the data are loaded they're basically ready to go this type of data.

The pipeline has two key limitations. One ETL pipelines typically work best for data that can be represented in rows and columns. For example, if you're working with image data or text data to store it in a database, it needs to be reconfigured or processed in some way another limitation. ETL processes can only serve a relatively narrow set of use cases. This is because any transformation you do on a data set will naturally constrain.

The types of use cases that it can be used for is, why the second type of pipeline is becoming more popular? Which is elt. So in this type of pipeline, the data are extracted and loaded with minimal adjustments typically into a data warehouse or data Lake and then any sort of Transformations will be done.

Handling API Pagination and Data Extraction in Python

Understanding Page Tokens and API Pagination

On a use-case basis, the upside of this is that all processes can support all data formats. This includes tabular data like in the ETL process, but also text documents or PDFs as well as images can all be stored in a data Lake. Another upside is that minimal adjustments are made to the raw data. This gives more flexibility in the types of Transformations that can be done for Downstream tasks.

This gives more use case flexibility for ELT pipelines. While ELT is becoming more and more common across Enterprises. When we're talking about full-stack data science and building machine learning projects end to end. This ETL pipeline may be a natural choice for most projects. So let's go deeper into each key component of a data pipeline starting with the extract process which is acquiring data from its source.

These days most data sources for most businesses are managed by Third parties. This is great because this allows us to access data through APIs. So an API is an application program interface. It's essentially a way to interact with an application using Code.

Extracting Multiple Pages of Data Efficiently

What this might look like is that we have an e-commerce business that uses HubSpot as their CRM Shopify to host their website and to make their sales and then they promote their business on Facebook and Instagram. If we wanted to pull data from all these different data sources. We could use the respective APIs hubs, swots API, Shopify's API, and meta, different APIs. This would allow us to get the lead data sales data and social media data to build some kind of customer loyalty model or build a dashboard or anything anything like that.

However, there are situations where the data we want to access aren't available via APIs. In these situations, we will have to develop custom extraction processes. A few examples of this could be scraping public web pages of course. You've got to be careful with this. You don't want to break any copyright laws or break the terms of use for a platform that you're using.

Another example is pulling documents from a file system whether that's an internal file system or an external one or sensor data. This is something, I did a lot in grad school. We were using environmental and physiological sensors and had to write a lot of code to pull that data.

Processing and Storing YouTube Data for Analysis

How to Extract Article Metadata Using Python

Next, we have the transform step, which is translating data into a useful form a big part. This is when we extract data from a raw Source. It is often either semi-structured or unstructured. Semi-structured data are things like text in a JSON format or a CSV file. JSON is very popular because often when you're working with an apis. The responses that you receive from them will be in this JSON format or you're working with unstructured data. Let's say you want to build a language model or fine-tune a language model on company documents. This can involve extracting text from docx files or PDF files and translating semi-structured and unstructured data into a structured form.

So basically something you can put into a database which is a major part of the transformation step. However, transformation involves much more than just making unstructured and semi-structured data. This could be managing data types and ranges of specific variables could also be duplicating your data and imputing missing values.

Another common task is handling special characters and special values and finally feature engineering so preparing your data so you can feed it into a machine learning model.

Storing Extracted Data in Python Data Structures

Finally, we have the load step which is about making data available for machine learning training or inference. While the details of the load process will depend on the specific use case. Here I'm going to walk through a handful of storage solutions and when it might make sense to use each of them. The simplest way to load your data is to save it in the same directory as your machine learning project.

Loosely speaking, this is appropriate when the data are of a megabyte scale maybe come from a few sources, and only have the one use case of that specific project that you're working on another option is to use any cloud storage solutions such as S3 Google Drive Dropbox.

One drive the list goes on these are great because they give you a very easy way to load and access your data and often come built-in with redundancy and version tracking. However as your data starts to scale and your use cases start to scale simple storage solutions, might start to fall apart. So in those cases, you may want to start using a database. So popular database Solutions are my SQL and pastries sequel but as your data scales and your organization scales at some point, you may want to transition to a data warehouse which is a more modern and scalable storage solution.

Extracting Transcripts from YouTube Videos Using Python

Using YouTube Transcript API for Transcript Extraction

The key difference between data warehouses and databases is that data warehouses have distributed infrastructure. So they can really scale to the Moon. Here I put terabyte scale, but most data warehouses can support even petabyte-scale data, and then finally if you have an overwhelming amount of data to manage and essentially endless use cases. You may want to consider a data Lake that uses the more modern ELT pipeline.

I described earlier so once we've defined our extract transform and load processes. The next step is to bring it all together, while simple data pipelines can live in a couple of Python scripts as your pipelines become more and more sophisticated. This raises the need for orchestration tools.

The main idea behind an orchestration tool is to represent your data pipeline as a directed acyclic graph or dag for short and all that means is, we will represent our tasks like extract transform load as so-called nodes and then we'll connect the tasks together with arrows, which represent the dependencies so a very simple data pipeline might look like this, where you have some trigger. so maybe every day at midnight this trigger will go off an extraction process will be kicked off.

Handling Missing Transcripts and Non-Speech Videos

The data will be transformed in some way and then it'll be loaded somewhere. While traditionally, you may have used a combination of Python and command line tools these days. There are a lot of tools for orchestrating data pipelines such as airflow which is a very popular one dagster Mage and many more.

Another key part of orchestrating is that these data pipelines are observability. This basically gives you visibility into your data pipelines because often these are automatic processes just happening on their own, no humans are involved. So being able to monitor the performance and the status of your data pipelines is very important and even sending automated alerts. If things start to go wrong, so I believe all these tools come built-in with some observability functions. So this has really never been easier to implement okay.

So with a basic understanding of data pipelines, let's walk through a concrete example. Here I'm going to build upon the case study from the previous article and walk through building a data pipeline to get article transcripts from from all of my YouTube articles. So to do this, we'll start by importing some libraries. The requests Library allows us to make API calls. The Json Library allows us to work with text in a JSON format.

Transforming Extracted Data for Machine Learning

Performing Data Quality Checks: Removing Duplicates and Ensuring Consistency

Polers is basically a faster version of pandas. This line here is importing my YouTube API key from an external file and importing this YouTube transcript API Library which will allow us to grab the transcripts programmatically. We're going to start with the extract process.

So first I Define my channel ID then I'll Define the URL for YouTube's search API. So it's pretty simple just googleapis.com. You search out version 3 and search next. I initialize a page token which we'll see why this is important in the next slide and then I will initialize a list to store data for each of the articles. This is a pretty crazy piece of code here. But I'll explain it line by line. Let's ignore the W Loop for now.

The first step is we're going to define the parameters for the API call. We're going to load all of these into a dictionary. The thing we need to send to the API to get a response is the YouTube API key. The channel ID is what sort of search results we want to receive. So snippet includes just a bunch of data about a particular article and ID. I believe is just the article ID order is how we want the results to be ordered so I ordered it by date.

Converting Data Types and Handling Special Characters

The most recent article will be first next. We have Max results which sets the maximum number of search results to receive back and then finally, we have a page token the way. You can think about this: making this API call is just as if we went to the YouTube website and made a search through the user interface.

What's different here is that instead of using the UI, we're doing the search programmatically just like when you do a search. You have a certain number of results and then if you want to see more results, you'll have to scroll down or go to the next page to see those results. That's exactly what's happening here which is why we need to define the page token. So any set of search results will be split across multiple pages and you'll need to look at each page one by one to find all these search results.

If you're thinking why can't we just have all the search results on one page, so we don't have to go through Page by Page. The reason is the maximum number of results per page that is allowed through YouTube's API is 50. This is why I initialized that that page token To None, because when we first do that search result we don't know what that page token will be. So it'll default to the very first one and then we can Loop through each page one by one in this while loop.

Saving Transformed Data for Future Use

How to Save Data Efficiently Using Parquet Files

Once we've defined these parameters, we can make the get request. We use the requests library to do that. All we need to do is specify the URL and this dictionary of parameters and then I have this function get article records which takes the API response and creates a list of dictionaries where each item in the list corresponds to a specific article. So I will add this list to the existing article record list then I'll try to grab the next Page token that exists in the response so the response text will return the text of the API response and there's this next page token item that we can extract.

However if a next page doesn't exist meaning, we've reached the end of the search result page. This line of code will fail. So it'll jump to this accept thing and the page token will be set as zero. So if we go back up to the W Loop, if we got a page token, we'll go through this again. If we didn't get a page token this will kill the wall Loop and go on to the next thing.

The next thing here is this get article records that's actually a user defined function defined in this way this is a lot of what extracting is all about you have this raw text from an API and you got to extract the information that you actually care about. That's what I'm doing in this function here.

Why Parquet Files are Better Than CSV for Data Storage

It'll take in the API response. It'll initialize a list, so we can store data for each article walking through this. What I'm doing here is looping through each item of the API response. Since our search has anything to do with my channel. The search will return articles shorts and Community posts and since we don't really care about Community posts.

We want to make sure that we skip the extraction for each of those items, but unfortunately YouTube # article is the same for both YouTube articles and YouTube shorts. So you'll need to find another way to separate out shorts. If that's something you want to do and then we can just extract the relevant data. So I initialize this dictionary then I add information element by element. So I got the article ID which we can access like this.

The date, the time, and the article were posted which is available here, and then the title of the article which we can grab like this, then we can append the dictionary to this list that we initialized earlier and then just keep going through every single item in this API response.

So all we've done through this whole process is we've extracted article IDs. The date it was posted and its title, we haven't actually gotten the transcripts yet. So in order to do that we got to keep going what I do is I take this article record list and I store it into a poers data frame. So this just makes it a little easier to access all the information.

Introduction to Text Embeddings for YouTube Video Analysis

What are Text Embeddings and Their Importance in ML?

We'll have the article ID as a column. The date and time as a column column and the title is a column to grab the transcripts. We will need the article ID. We'll see in the next slide how we can Loop through each of the article IDs and extract the transcripts for each article. This is what the code looks like. So again I initialize a list to store the text of each transcript. Here I Define an index I for each row in the data frame. So what I do here is I try to extract the transcript using this YouTube transcript API.

This Library makes it super easy to grab the transcript of any article. All I have to do is provide the article ID and it'll return back a transcript the transcript actually comes in. I believe in a dictionary format so I wrote another user-defined function to extract, just the text from that transcript and then I set it equal to the transcript text.

However, if the library is not able to extract a transcript, this chunk of code will fail and then we'll just set the transcript text as an, so this happens when there's no language on an article. So basically there's no talking in an article.

However, I have a few articles that don't have any talking in it and then finally we append the transcript text to the transcript text list so we can doubleclick into the extract text function to see what that looks like and it's actually super simple. So it'll take in a list, okay so the transcript actually comes in a list format and the extract text Will map it to a string.

Preparing Data for Text Embedding Models

I kind of do everything in one line of code. Here I take the I element of this transcript list and then extract the text from it. So I think each element of this list is a dictionary where one of the keys is the text. So you can think of it as like each line of the transcript is stored in a dictionary and then text will be one of the fields of that dictionary. So we're just extracting every single line and just joining it into a massive piece of text.

Next, we can add the text from all the transcripts to our data frame and so we can that with this line of code here, and then that'll look like this. So now we have a new column says transcript all the text from each article is here okay. So that was the extract process definitely a lot of coding but now we can move on to transforming.

So transforming will involve a little bit of exploratory data analysis or Eda for short. There are a few things that I do for this specific use case. The first thing is I check for duplicate values so basically I want to check if they're no identical rows cuz that wouldn't be a good thing. We don't want repeating articles and also I want to make sure there are no duplicates on the column level Beyond just transforming data into a usable form.

Preparing Data for Text Embedding Models

This is just a good thing to check to ensure data quality. The way I do this is I just simply print the shape of the data frame, print the number of unique rows, and then print the number of unique elements in each column. We see that the data frame has 84 rows and four columns. It has 84 unique rows which is good which means there are no repeating rows.

we have 84 unique article ID date times and titles which is good because there shouldn't be two articles with the same ID. I've never posted two articles in one day and then there should be no two articles with the same title but when we go to transcript, we see there are only 82 unique transcripts and this is actually expected because I have three articles that don't have any talking in them.

So for these articles, I set the transcript as na in the script earlier another thing is we want to check the data types looking at our data frame. We see everything is a string while this this is appropriate for the article ID title and transcript it's not appropriate for daytime. This should be saved as a datetime data type and this is super easy and polers just do it in one line of code. We just updated this datetime column as a pl. Time object and then we can print the head again. We can see that it changed the format of this column.

Next Steps: Building a Semantic Search System for YouTube Content

Selecting the Right Embedding Models for Your Use Case

The last thing I'll do here for the transform step is handle special characters. This took a bit of like manual skimming but after a few minutes. I found a few character strings that weren't correct. Specifically, some of the apostrophes were represented as this string of text. Here some of the Amper stands were represented like this and then finally this isn't really a special character but more like a quality control thing.

The automatic captions don't know who I am. So whenever I say Icoversai it comes out like this. So I just corrected the spelling and then we can Loop through each of these special strings and replace them in the title and transcript Columns of the data frame.

How to Implement a Search Function Over YouTube Videos

Finally, we can load the data set and since this data is hilariously small it's about 200 kilobytes for the entire data frame and this is only for a proof of concept. It's more than sufficient to just load this data into the directory of the project that I'm working on.

Just do that with one line of code where we save this data frame as a parquet file using the right parquet method. I could have also used CSV but par is actually a compressed file format. So the park file is actually a third of the size of the equivalent CSV file.

Conclusion and Further Learning

Summary of ETL Process in Machine Learning Projects

So in the next article of the series. We'll move on to what I call phase two of a machine learning project. What this will consist of is taking the parent files that we created in this example and generating text embeddings from them the reason we want to do this is if you saw in the previous articles of this series we can use these text embeddings to create a semantic search system over my YouTube article. This will require a bit of iteration and experimentation to see which embedding models will provide the best performance.

Additional Resources for Learning More About ETL and M

For this specific use case and then once we've picked out the embedding model. We can create this search function. This whole process is going to be the focus of the next article that brings us to the end.

I hope you got some value out of this content. If you enjoyed it and you want to learn more check out the medium blog Linked In the description below and as always thank you so much for your time and thanks for reading.

How to Build Data Pipelines for ML Projects - icoversai

Table of Contents:

Introduction to Extract, Transform, Load (ETL) in Machine Learning Projects

What is ETL and Why It’s Important for Machine Learning?

Overview of the YouTube API and Its Role in Data Extraction

Getting Started with Data Extraction Using YouTube API

How to Set Up and Use the YouTube API for Data Retrieval

Key Parameters in YouTube API Calls Explained

Handling API Pagination and Data Extraction in Python

Understanding Page Tokens and API Pagination

Extracting Multiple Pages of Data Efficiently

Processing and Storing YouTube Data for Analysis

How to Extract Article Metadata Using Python

Storing Extracted Data in Python Data Structures

Extracting Transcripts from YouTube Videos Using Python

Using YouTube Transcript API for Transcript Extraction

Handling Missing Transcripts and Non-Speech Videos

Transforming Extracted Data for Machine Learning

Performing Data Quality Checks: Removing Duplicates and Ensuring Consistency

Converting Data Types and Handling Special Characters

Saving Transformed Data for Future Use

How to Save Data Efficiently Using Parquet Files

Why Parquet Files are Better Than CSV for Data Storage

Introduction to Text Embeddings for YouTube Video Analysis

What are Text Embeddings and Their Importance in ML?

Preparing Data for Text Embedding Models

Preparing Data for Text Embedding Models

Next Steps: Building a Semantic Search System for YouTube Content

Selecting the Right Embedding Models for Your Use Case

How to Implement a Search Function Over YouTube Videos

Conclusion and Further Learning

Summary of ETL Process in Machine Learning Projects

Additional Resources for Learning More About ETL and M

Post a Comment

How to Do Data Cleaning (step-by-step tutorial on real-life dataset) - icoversai

Contact Form