Automating Data Pipelines with Python & GitHub Actions - icoversai

Learn how to automate data pipelines using GitHub Actions in this detailed guide. From setting up ETL processes to integrating workflows with GitHub, discover step-by-step instructions and best practices to streamline your data science projects efficiently.

Table of Contents:

  • Introduction to Automating Data Pipelines in Data Science
    • Overview of Full Stack Data Science Series
  • Why Automate? A Story from Grad School
    • The Grad School Ritual: Research and Relaxation
    • The Satisfaction of Automated Code
  • Two Main Approaches to Automating Data Pipelines

    • Using Orchestration Tools
    • Python Scripts with Triggers
  • Exploring Orchestration Tools for Automation
    • What is Airflow and Why Use It?
    • Alternatives to Airflow: Prefect, Dagster, Mage, and More
  • Python Scripts and Triggers: A Simpler Approach
    • Building a Basic ETL Pipeline with Python
    • How to Automate ETL Pipelines with Cron Jobs
  • Harnessing GitHub Actions for Workflow Automation
    • Introduction to GitHub Actions and CI/CD
    • Advantages of Using GitHub Actions for Automation
  • Step-by-Step Guide: Automating an ETL Pipeline with GitHub Actions
    • Setting Up Your ETL Python Script
    • Creating and Configuring Your GitHub Repository
    • Writing the YAML Workflow File for GitHub Actions
  • Integrating Secrets and Tokens for Secure Automation
    • Adding Repository Secrets for GitHub Actions
    • Creating and Managing Personal Access Tokens
  • Testing and Troubleshooting Your Automated Workflow
    • Running Your GitHub Action Workflow
    • Handling Common Issues and Errors
  • Final Integration: Connecting Your Data Pipeline to Machine Learning Applications
    • Deploying the Automated Pipeline to Google Cloud
    • Example Application: Semantic Search on Hugging Face
  • Conclusion: The Benefits of Automating Your Data Science Workflows
    • Recap of Key Points

Introduction to Automating Data Pipelines in Data Science

Overview of Full Stack Data Science Series

This is the sixth article in the larger series on full-stack data science. In this article, I'm going to be talking about automating data pipelines. I'll start with a high-level overview of how we can do that and then dive into a concrete example using GitHub actions, which gives us a free way to automate workflows in our GitHub repos.

If you're new here. Welcome, I'm icoversai. I make articles about data science and Entrepreneurship and if you enjoy this content, please consider subscribing that's a great no-cost way you can support me in all the articles that I make.

Why Automate? A Story from Grad School

The Grad School Ritual: Research and Relaxation

I'll start with a story about a friend of mine back in grad school. So it seemed about every Friday to cope with our existence as physics grad students. Me and many other grad students would find themselves at a bar close to campus and there was one grad student in particular, who would show up after doing research for several hours. He would be drinking a beer and he'd say something like technically.

The Satisfaction of Automated Code

I'm working right now because my code is running and of course, we would always laugh about it but this is a sentiment that I find a lot of data scientists and other developers share which is this sense of satisfaction when some piece of software that they wrote is off and running all on its own.

Two Main Approaches to Automating Data Pipelines

Using Orchestration Tools

While they are off doing something else that they enjoy doing like having a beer with fellow grad students. So this is at least one motivation toward automating data pipelines and other aspects of the machine learning pipeline.

Here, I'm going to talk at a high level about two ways. We can automate data pipelines. The first way is via an orchestration tool which is something I mentioned in the previous article on data Engineering. This series includes things like airflow dragster Mage and many more. 

Python Scripts with Triggers

The second way is what I call Python plus triggers. These are Python scripts that run given particular Criteria such as it's a particular time of the day or a file appears in a directory.

Exploring Orchestration Tools for Automation

What is Airflow and Why Use It?

The first way is an orchestration tool. So I'll start with airflow because of the dozens of interviews. I've had with data engineers and ML engineers it seems that this has emerged as a standard tool for many building data and machine learning pipelines. One of the biggest benefits of Airfow is that it can handle very complicated workflows that include hundreds and even thousands of tasks. This is a major reason why this has become an industry standard among data Engineers managing and Enterprise scale data pipelines.

Alternatives to Airflow: Prefect, Dagster, Mage, and More

One downside of airflow as someone who was trying to learn it specifically for this series and this project that I've been working on is that it can be pretty complicated to set up and maintain. These workflows and of course that come with a steep learning curve these challenges of getting set up with airflow have created a market for airflow rappers such as Prefect Dagster Mage and Astro. The upside of these wrappers is that you can tap into the power of airflow with a potentially simpler setup in maintenance.

However, one downside of these airflow wrappers is that this ease of use is often coupled with managed or paid services for these wrappers. However, to my knowledge, most of these have open-source versions. If you want to go the self-managed route regardless of what orchestration tool you want to use for many machine learning applications, where there's a relatively simple data pipeline. These tools may be Overkill and may overcomplicate your machine learning system which motivates the second approach to automating data pipelines using Python plus triggers.

Python Scripts and Triggers: A Simpler Approach

Building a Basic ETL Pipeline with Python

You can say that this is the old-fashioned way because before we had these orchestration tools if you wanted to build these data and machine learning workflows you'd have to build them from scratch so to make this a bit more concrete let's look at a specific example say we wanted to implement a very simple ETL pipeline consisting of one data source and one target database what this might look like is we pull data from a web Source.

We'll run an extraction process via a Python file called extract. py, we'll transform the data in some way in a Python script called transform. piy then we'll load it into our database using another script called load. piy so this is a simple ETL pipeline which is something I discussed in the previous article article in this series on data engineering given these three Python scripts nothing is stopping us from creating yet another script that consolidates all these steps we can call that script ET L.P.

How to Automate ETL Pipelines with Cron Jobs

Now instead of running three scripts consecutively. We can just run one script. Although this has streamlined the process. This is still not automated because we still have to manually go to the command line and type python space L.P. This is where the idea of a trigger comes in. So let's say we wanted to run this ETL pipeline every single day. One thing we could do is run a CRON job. CRON is a command line tool that allows you to schedule the execution of some processes. 

Let's say we wanted to run our ET L.P file every day at midnight we would type something like this into the command line and then that would get saved in our system as a Cron job and every day at midnight This would run but of course, if we do this would be running the Cron job on our local machine which may not be something that we want to do.

Some alternatives could be doing this process on a server that we manage or spinning up a cloud resource that runs this process. But both of these options come with a bit of overhead in maintenance for setting up these compute resources so it would be great.

Harnessing GitHub Actions for Workflow Automation

Introduction to GitHub Actions and CI/CD

We could set up this trigger to execute our ETL pipeline without the extra effort of setting up these compute resources. This is where GitHub actions are super helpful git. GitHub actions are git hub's built cicd platform. CICD stands for Continuous Integration and Continuous Delivery. The typical schematic of cicd looks something like this, where I suppose. This is supposed to be some kind of infinite Loop of integrating new updates into your software system and delivering those updates in real hearing.

You might be thinking icoversai, what does continually deploying code have to do with data pipelines? Well, although data may not play a role in traditional software development when it comes to building machine Learning Systems. Data plays a central role in software development, put another way when it comes to machine learning, we use data to write programs to write algorithms which are manifested as a machine learning model.

Advantages of Using GitHub Actions for Automation

We're talking about continuously integrating and delivering a machine learning application that will require us to continually integrate and update the data that we feed into that system. Then two key benefits of GitHub actions. Firstly the computer to run these workflows is provided for free for public repositories. This is great for poor developers like me who just want to put out example code like for this article. For those who are building a proof of Concepts or building projects for their portfolios and of course, there are paid versions for Enterprises and businesses.

The second thing is that we don't have to worry about setting up compute environments, whether that's an on-premise server or a cloud resource. All the setup happens via writing a YAML file with that high overview. Let's walk through a concrete example of automating an ETL pipeline to turn YouTube article transcripts into text embeddings.

Step-by-Step Guide: Automating an ETL Pipeline with GitHub Actions

Setting Up Your ETL Python Script

If you're unfamiliar with text embeddings. I have a whole article where I talk about that, but understanding what they are is not necessary. For this example, here are the steps we're going to walk through. One we're going to create our ETL python script. Two we're going to create a new GitHub repo. Three we're going to write our workflow via the yaml file. Four we're going to add repo secrets to our GitHub repo. This will be necessary to allow the GitHub action to automatically push code to the repo and also to make API calls to the YouTube API without exposing my secret API key. Then finally, we'll just push and commit all our code to the repo.

Let's start with the first one creating our ETL python script. So here I have a fresh Visual Studio code. I created a new folder and now I'm going to create a new file. I'm going to call it a data pipeline. pi and I've already pre-written the code here. So I'll just paste it over and explain it one by one.

The first thing, I'm going to do is import the first line. Here we're going to import a set of functions from another Python script called functions. piy which we'll write in a second and then we'll import the time module and the date time module. The reason we want time and date time is that it'll allow us to print when this Python script is run and how long each of the following steps took to run toward that end.

First, I'm going to print the time that the pipeline is starting to run. We can do that using the datetime module. Next, I'll start importing the different steps of our data pipeline. So the first step is the extraction. The first thing it's going to do is extract the article IDs of every single article on my YouTube channel that whole process is baked into this get article IDs function which is going to be in this functions. piy script that will be created in a second. This is something I walked through in the previous article of the series on data engineering. Some other fanciness that's happening here is that I'm just capturing the time just before and just after this step is run.

I can print how long it took to run this function and this is helpful for debugging purposes and observability. Once we have the article IDs extracted. We can use a Python library to extract the transcript of each article given its article ID using a similar process. This whole get article transcripts process is abstracted as this function and we're capturing how long it took to run that step.

Next, I have this transform data function which is just doing some data cleaning. So ensuring that the data have the proper types and handling any special character strings. Then finally step four, we're going to take the titles and the transcripts from all the YouTube articles and make the text embeddings with these files in place. We'll go ahead and create another file and this will be the functions of do Pi file. Here, we will put all the different steps that we just put into our data pipeline.

There's a lot here actually more functions than are used in the data pipelines script. I'm not going to walk through this because I've talked about it in previous articles. But of course, this code is freely available on GitHub, so you can check that out if you're interested. One thing, I'll point out is that at each of these key steps in the data pipeline. We're not outputting any data structures or data frames as intermediate steps. The data are actually being saved to this data directory. This is actually something we'll have to create. We'll create a folder called Data.

Creating and Configuring Your GitHub Repository

One last thing, you'll notice is that we're importing a ton of libraries here. So I'll go ahead and create another file called requirements. text, where we can put all the different requirements for these data pipeline polers. What I'm using to handle all the data frames YouTube transcript API is the Python library that allows us to pull the YouTube transcripts given an article's article ID. I'm using the sentence Transformers library to create the embeddings. Then I'm using the requests library to make API calls to the YouTube API.

Now, that we've written all the code for our ETL pipeline. Let's create a fresh GitHub repo, we can push the code to we can create a GitHub repo from the command line or from the web interface. I'll do it from the web to do that you just go to your repositories Tab and click new, it will give it a name. I'll call it a data pipeline demo and I'll just say demo data pipeline via GitHub actions. We'll keep it as a public repo.

One benefit of doing it as a public repo is that GitHub won't charge you for the compute costs of the GitHub actions. I'll add a readme file I'll also add a git ignore using the Python template and then I'll choose a license. I'll just do Apache 2.0. All right, we now have our repo and then what we'll want to do is we'll want to clone our repo. Open up our terminal to whatever folder we want to store this repo in and then we'll do get a clone and then we'll download this newly created repo. We can see we have the license and the read me. 

Writing the YAML Workflow File for GitHub Actions

Now that we have our repo, we can go ahead and write our yaml file which will Define the workflow that will automate the execution of our data pipeline, okay. The next thing we want to do is add all the code. We just wrote to this new repo. We created the code. As we wrote earlier I stored it in this data pipeline demo temp folder. So what I'm going to do is just copy and paste that over. We go back here and hit LS. We see that all those files are here and then what we want to do is create a new folder. We'll call it GitHub and then in that GitHub folder, we'll create a new folder called Workflows.

The reason we do this is that GitHub will look in this GitHub workflows subdirectory for all the workflows that we want to run as GitHub actions with that directory created. We can open up our GitHub repo locally, so it's called Data Pipeline demo. We'll open that up and we can see all the code is here and our GitHub workflows folders here. So what we wanted to do was create a new file in this workflow folder. We'll call it a data pipeline. yml and here is where we'll Define our GitHub action. 

There are a few key elements of a GitHub action . The first is its name which we Define like this and if you're not familiar with yaml files, they're essentially Python dictionaries. So they consist of these key-value pairs that can additionally be nested. So the simplest of these when it comes to GitHub actions is this one we wrote here which is the name of the workflow which I'm just calling data pipeline workflow.

The next element of the workflow is the trigger. So this will Define when this workflow is run that's specified by this syntax and from here, we have a few different options. We can make it so that this workflow runs anytime. New code is pushed to the repository. We just do that by writing this another thing. We can have the option to manually trigger this workflow using this option which is workflow dispatch. So we'll see later that setting this option a button will appear under our GitHub actions tab which will allow us to manually run this workflow.

The one that we probably care most about is this option which allows us to schedule our workflow as a Cron job which is what we saw in an earlier slide to make this Aon job. We simply type cron here and then we specify the arguments for the Cron job. What I want is for this workflow to run every night at 12:35 a.m. So the Syntax for that looks like this and if you're not familiar with the syntax of running a Cron job, there's a great website called crontab. Guru, it has a guide for scheduling Cron jobs. So this is what we wrote in our emo file. It's saying at 0035. It's going to run and it actually tells you the next time that this job would run alternatively. We could do all the asteris which means this is going to run every minute.

However, when it comes to GitHub actions 5 minutes is the fastest that it can run. So you would write every fifth minute like this but I tried this and even if you specify to run every 5 minutes, it still won't run that fast. It'll run closer to every 15 minutes. So there are some limitations on how quickly you can run a workflow using GitHub actions. But here we don't want to do anything crazy and just running it once a day is fine because I'm posting articles every week.

So running this workflow every day is more than sufficient now that we have the name of our workflow and we're specified when it's going to run the next thing. We want to Define the workflow itself workflows consist of jobs that have names. So the names of the jobs are specified like this and then what we can do is specify what system this specific job will run on. So we'll say Auntu latest. This will just run on the latest version of Ubuntu which is a Linux operating system and then jobs will consist of steps that we can specify like this.

Integrating Secrets and Tokens for Secure Automation

Adding Repository Secrets for GitHub Actions

Here we're going to have a handful of steps to run this whole workflow. The first step, we'll call it checkout repo content and we can actually make use of pre-built GitHub actions provided by GitHub. This one is called checkout version 4. What this step of the workflow is doing is pulling all the code from our GitHub repo and if you're curious about this specific action, we can Google it and then there's a repo on GitHub that has a lot more information about this action. You can check this out if you like it at github.com actions. Checkout then, I'll do one more thing here.

I'll add this with the thing and I'm going to specify a token what I'm doing here is giving this workflow access to one of the repository Secrets which will give it read and write access. Since it's a public repo. It doesn't need any special permissions to pull the code onto this auntu instance at spun up, but later in the workflow, we're going to push code to this repo which will require special access. So that's the reason I'm adding this token to give the action.

The proper permissions and we'll create this token in a second. I'll show you how to make it a repository Secret, okay. So that's step one of our job all we did was create a fresh auntu instance and then pull the code from our repo. The next thing, we're going to do is set up Python. I'll call this step of the workflow set up Python. I'll use another pre-built action called setup Python version 5 and similarly, if you're curious about this you can just Ty it into Google and then you can pull up the repository for setup Python. 

It'll show you the basic use usage and have more information on that but it does exactly what it sounds like it does then we can add this with things and specify the Python version. Here I'll use 3.9 I'll also add this option which will cache the libraries that we install with Pip. So it doesn't have to install these libraries from scratch. Every single time it runs the workflow, it can just install it once and then it can reuse those libraries from the Cache in that second step.

We set up python. Now we'll do another step and we'll install the dependencies. Here we're going to use a different command so we'll use this run thing this is essentially like running a command from the command line. So we'll just do pip install requirements. text. This is just as if we opened up our terminal and typed. This is in the command line and the reason we can do that is because we've just installed Python on the machine. So now we can run pip on it with the libraries installed we can now run the data pipeline.

I'll Define another step and call it the run data pipeline. I'll need to do one thing here which is import an environment variable called YouTube API key. This will be another repository secret that will be defined later and what this is is my YouTube API key which is needed to run this function here the get article IDs. Once we've done that we can now run our data pipeline. So we do that like this. We'll use the Run command again and we just type Python data pipeline. piy just like we were typing this into the command line.

Now we've run this Python script on this machine that just got spun up and then what we can do next is see if any changes were made to the git repository. After running the data pipeline give it an ID and then we'll run another command. I'll use this vertical bar syntax which will allow me to import multiple lines of commands but we can go through this one by one. What's Happening Here? is that first we're configuring the GitHub account. Then we're adding all the local changes what this line is doing is that it's checking the difference between the staged changes and the last commit.

This quiet option will set the exit status of the command to zero if there are no changes and we'll set it to one. If there are changes, so the exit status is one, we'll create a new variable called changes and that'll be equal to true, and then we'll store this in the GitHub environment. What that allows us to do is do one last step. We'll call it commit and push if changes. What we can do is basically have an if statement. So if changes, so this environment variable that we just created equals true. We'll run this command. So again, we'll use this vertical line to do a multi-line command and we'll commit the staged changes and then we'll push it to the repository.

This is where we need the special permissions from the personal access token that we defined earlier and then one last thing, we need to do is add a file to the data folder because since it's empty GitHub won't actually push any empty folders to the repo. So we should just have a file in here and then we can just say something like data go here and with that, we've created our workflow in this data pipeline yaml file.

We have our data pipeline. Here we have all the functions that make it up. We have our requirements file with all the libraries that are needed to run. Our data Pipeline and we have a data folder in which we can store all the data produced from our data pipeline. In this example, we can get away with storing the data on GitHub because it's super small. It's not going to be more than 100 megabytes or so but if you're working with data sets much bigger than that talking about gigabytes or even more in that case. You probably want to store that in a remote data store and those are changes you would just make in your data pipeline itself.

For example, instead of reading and writing to a local directory, you would read and write to an S3 bucket or to Google Drive, or to a database. So that'll just depend on your specific use case before we can push all these local changes to our GitHub repository. We need to create two secret variables that are available to the GitHub actions for this pipeline to run successfully and so to create a repository secret. We'll go over to settings. We'll scroll down to secrets and variables and we'll click actions. We'll see a screen like this and we'll see this repository Secrets section. This will allow us to create repository secrets that will be accessible to our GitHub actions.

Creating and Managing Personal Access Tokens

The first one we wanted to create was this personal access token, but we need to actually create it personal access token for this. So to do that I'll click on my profile here. Scroll down to settings open that in a new tab and then I'll scroll down to developer settings click on personal access tokens and then I'll click on tokens classic and you can see that I've already created some personal access tokens. 

But what I'll do is create a new one. It asks for a note. So I'll call this data pipeline demo P expiration. We'll just leave at 30 days and then we can select the scope. So we only need this repo scope here and the reason is we just want our GitHub action to be able to push code to our public repo. It has read and write access and then we can actually leave everything else unchecked to do that we'll just hit generate token and then this token will appear. You shouldn't share this token with anyone. I'm sharing it with you because I'm going to delete it right after this demo but we can copy that and then we'll come back over here and we'll paste that into our secret. 

Testing and Troubleshooting Your Automated Workflow

Running Your GitHub Action Workflow

Now this personal access, ACC token will be accessible as an environment variable to our data workflow. We just hit add secret and then I'll do a similar thing for my YouTube API key. I will not share but if you're importing your own YouTube API key or any kind of API key. It's important to just paste the RW string and not put quotations around the API key. I'll just add that and hit add secret now we have two repository Secrets.

Here are the personal access token and the YouTube API key. So now with the secrets in place, we can commit and push all our changes to the GitHub repo and read the workflow run at our local changes. We'll commit to our local changes. First push and then we can push all the local changes to our before we do that. Let's add one thing to our git ignore file. If you're on Mac, you'll notice that your GitHub repos will always add this DS store thing. I don't want that so I'm going to include that in the git ignore file and we'll get push.

Now we pushed it to the repo and I guess, I didn't properly remove this. So I'll just go ahead and manually delete this. So we see all our code has been pushed to the repo. We can see that this little pending dot is appearing. So if we click that we can see that our workflow is running and so to read that we can click on the Actions tab. If we click on this and go to run the data pipeline. We can see that run data pipeline was the name of our job in the data pipeline. yo file and then these were all the steps that we defined.

So we'd set job check out repo content set up Python install dependencies and then run a data pipeline to check for changes commit and push if there are changes then these steps are automatically generated from the pre-built actions of setup Python and the checkout action. So we'll wait for these dependencies to install. All right so the dependencies install took about 2 minutes. Now it's going to run the data pipeline. So this will actually take longer than the dependence because what's happening is we're making the API calls to the YouTube API to grab all the article IDs.

There's another step of grabbing all the transcripts for each YouTube article and then finally we have to generate the text embeddings for those transcripts and the title of those YouTube articles. So this might take a few minutes. We can see that everything, we printed in our data pipeline is showing up here and then we're checking for changes and the word changes. So it's pushing the code but it failed and it's probably, because I deleted that DS store after this workflow got kicked off.

This is a great opportunity to actually go to our workflow and use the manual trigger. So this was that dispatch workflow option that we created. So we can actually run this manually like this and so once we click that, we can see that now it's running again. Now we go through that whole process all over again. All right, so this time we see the data pipeline ran successfully and it committed and pushed the changes. So we can see that it added these three files and then it's going to go to just do some post setup of Python and then post checkout of repo content while that's finishing up.

We go to our data folder and we can see that the data from our workflow is here. Now these files can be used in some Downstream tasks, Okay. So we've successfully automated the data pipeline.

Final Integration: Connecting Your Data Pipeline to Machine Learning Applications

Deploying the Automated Pipeline to Google Cloud

The next natural step is to integrate this automated data pipeline into the final machine learning application. I actually do that in this repo here, which I'll also Link in the description below. So we can just click that and we can see that we have our workflows folder here. We have our data pipeline. yam of file. So this is the same thing that we just wrote and then we have our data pipeline here.

These were the files that we defined earlier and the rest of this is similar to what we saw in the previous article of this series where we created a search API endpoint using fast API and Docker. However, here instead of deploying this API endpoint to AWS. I deployed it on the Google Cloud platform specifically using the cloud run service and the reason I did that is because they have a free tier also they support continuous deployment so anytime.

Example Application: Semantic Search on Hugging Face

A change is made to this GitHub repo it'll redeploy the API running on Google Cloud and then the front end will be publicly available I'm hosting it on Hugging Face I'll also put this in the description below but this is now live so you can go and see the fruits of all the labor of this article series the first run is going to be slow because it has to wake up the container on Google Cloud but eventually this spins up search results so if I type in llms it'll bring up all the articles on llms I can do something else like data freelancing and then it shouldn't take as long for the second run cuz the container's already awake and we get the results for data freelancing.

Conclusion: The Benefits of Automating Your Data Science Workflows

Recap of Key Points

We can of course do this series on full-stack data science and see the search results come out much faster. The container's now awake, you can see all the different articles relevant to full-stack data science. Well, that brings us to the end of the series. We've come a long way talking about the four hats of a full-stack data scientist being that of a project manager.

If you're curious to learn more about full-stack data science or the details behind this semantic search app, check out the other articles in the series as always thank you so much for your time and thanks for reading.  


Post a Comment

Previous Post Next Post