What is Principal Component Analysis (PCA) | Introduction & Example (Python) Code - icoversai

Discover how Principal Component Analysis (PCA) simplifies data by reducing dimensionality. Learn its mathematical foundations, real-world applications, and how PCA can be used for stock market analysis, including building an S&P 500 index fund.

Table of Contents:

  • Introduction to Principal Component Analysis (PCA)
    • What Is Principal Component Analysis?
    • Why PCA Matters in Data Science and Finance
  • Understanding the Core Concept of PCA
    • The Band Analogy for Simplifying Dimensionality
    • How PCA Reduces Dimensionality in Data Sets
  • The Mathematics Behind PCA: A Simple Explanation
    • Key Mathematical Concepts in PCA
    • How Does PCA Maximize Variance?
  • Step-by-Step Process of PCA
    • Data Preparation: Why Scaling and Centering Are Crucial
    • Solving the PCA Optimization Problem
  • Eigenvalues and Eigenvectors in PCA
    • The Role of Eigenvalues and Eigenvectors Functions in PCA
    • How to Extract Principal Components Using Eigenvalues
  • Key Features of Principal Component Analysis
    • PCA and Variable Redundancy Reduction
    • How to Determine the Appropriate Number of Principal Elements
  • Real-World Applications of PCAHow to Determine the Appropriate Number of Principal Elements
    • Using PCA for Clustering, Correlation Analysis, and Outlier Detection in Data Science & Finance
  • Using PCA to Analyze the Stock Market: A Fun Example
    • Applying PCA to S&P 500 Data
    • Creating a PCA-Based S&P 500 Index Fund
  • Comparison of PCA-Based Portfolio to the S&P 500
    • How Did the PCA-Based Index Perform in 2020?
    • PCA for Portfolio Optimization: A Visual Comparison
  • Conclusion and Next Steps
    • Key Takeaways from PCA in Stock Market Analysis
    • What’s Next: Independent Component Analysis (ICA)

Introduction to Principal Component Analysis (PCA)

What Is Principal Component Analysis?

Hey guys. Welcome back! I'm back with another series. If you missed the first one, it's available on my website. It was on time series signals of the Fourier transform and the wave of the transform. In this new series, I'll be talking about two things one principle component analysis and two independent component analyses.

Why PCA Matters in Data Science and Finance

So principal component analysis or PCA is the topic of this article. So I'll give you a little intuition and share some math and then I'll finish with a concrete example of how you can use PCA to analyze the stock market? So let's get right into it so the analogy.

Understanding the Core Concept of PCA

The Band Analogy for Simplifying Dimensionality

I like to think of PCA is imagine like a massive rock band with like 20 members in the ensemble and I have. You know two drummers and several guitarists. You have several keyboardists or pianists. You have a string section and a horn section vocalist and percussionist. The whole works you have this 20-person band and you know that's not a big deal.

You know that's the kind of band made for huge arenas and stadiums. But if a band like this is just getting started they're gonna have a hard time fitting in smaller venues like coffee shops and restaurants. So a natural solution to this problem is to just kind of reduce the number of players at specific performances.

So instead of like a keyboardist pianist and whatnot, you could just have one person on the keyboard. Instead of having multiple guitars, you could just have one person do an acoustic guitar. Instead of two drummers and a percussionist, you can have someone banging on the bongos and so on in a lot of ways.

How PCA Reduces Dimensionality in Data Sets

This is basically what PCA does. So this is the big band on the left before PCA and then you can kind of boil it down to its core elements for the same band to play at the coffee shop. But instead of a band, you can think of a PCA applying to a data set. Instead of musicians or players in the band, you can think of the variables in your data set and instead of a song or the music, you can think of what your data set is representing.

A bit more concretely principal component analysis PCA reduces input dimensionality and redundancy. So we can think of two variables x and y. This could be something like hot dogs sold and hot dog buns sold which are directly correlated. But in a lot of ways contains redundant information. So it may be practical to represent this underlying information, instead of through two variables through just one variable, then that's the application of PCA.

The Mathematics Behind PCA: A Simple Explanation

Key Mathematical Concepts in PCA

We can transform our axes from this x and y axis to a new set of axes. We'll call them pc1 and pc2 and then if you want to take it a step further, you can remove pc2 and operate with one variable. So, we've reduced the dimensionality from two variables x and y to just one pc one. If we choose to drop pc2, okay. So how does it work with the basic idea?

PCA aims to reduce variable redundancy or input variable redundancy by creating a new set of variables where the variance along each subsequent variable is maximized.

So in the previous example, we saw pictorially that we changed from a set of two variables hot dog sold and hot dog bun sold to a new pair of variables. We call them pc1 and pc2 and essentially pc1 contained all the relevant information. We needed and the way we got pc1 is basically rotated the axes to be kind of along this linear slope of points defined by the hot dog bun and hot dog sales. What does that translate to mathematically?

How Does PCA Maximize Variance?

So we can think of this situation. We have x which is a matrix of data where the rows are data records and the columns are variables. We have w which is a vector of weights and then we have t which is a score vector and what I'm going to be calling a principal component. So t is what we're interested in we have our data x and we're trying to find a w that is going to create this principle component for us, okay. 

Here's the magic of PCA. Here's the trick to it all. So the goal here is to maximize the variance of t subject to the constraint that the norm squared of w. So w transpose times w is equal to one, okay. Then variances are defined in the usual way. So you take every element and subtract the mean of the variable. You square it and then you divide by uh the number of elements minus one. Then you just add this up for every single element in the set of numbers um.

Step-by-Step Process of PCA

Data Preparation: Why Scaling and Centering Are Crucial

So one really important thing. When doing PCA you want to auto-scale your data. So basically what does that mean for each number in each column of your matrix uh. You want to subtract the average and divide by the standard deviation. So if we do that then the mean of the principal component will turn out to be zero which allows us to kind of drop. The mean term in the variance. Here it turns out that the variance will just be equal to the norm squared of t divided by uh the number of elements minus one, okay.

Solving the PCA Optimization Problem

So what does that mean? That means we can rewrite this optimization problem instead of maximizing the variance. We can just maximize the norm squared of t because the vector w that maximizes the norm squared of t is also going to be the same vector w that maximizes the variance of t, okay. So we can rewrite uh the optimization problem using our above expression for t and it turns out this is actually a pretty straightforward optimization problem to solve and don't be intimidated by the matrices and vectors.

We can use a very well-known and common technique in calculus known as the method of Lagrange multipliers which basically allows us to rewrite an optimization problem with constraints a constrained optimization problem as an optimization problem without constraints or an unconstrained optimization problem, if none of that makes sense.

We just need these relevant expressions here. So we can write out the lagrangian which is this l of x uh term here for our PCA optimization problem. Then we can have the associated equations and this is the exciting part here.

Eigenvalues and Eigenvectors in PCA

The Role of Eigenvalues and Eigenvectors Functions in PCA

This first equation, if we rearrange it is just an eigenvalue problem which is a standard problem in linear algebra. Then the second equation is just a restatement of our original constraint. So writing it explicitly here. We can solve for the eigenvalue lambda and the vector of weights w using standard eigenvalue approaches. If you're doing this in some programming language.

Every programming language like r python and Matlab, they're going to have built-in functions that allow you to solve this problem. Then once we have this vector of weights, we have everything we need we can just multiply that by x and we can get our principal component. This naturally extends to multiple components, so this we started out just looking for a single component.

If you solve the eigenvalue problem you have n columns in your matrix x and x is square. You're going to end up with n eigenvalues and n corresponding eigenvectors. Then if you kind of sort these eigenvalues and eigenvectors from largest to smallest, you sort from the largest eigenvalue.

How to Extract Principal Components Using Eigenvalues

All the way down to the smallest each corresponding eigenvector w is going to be a set of weights, which define a principal component and the principal components associated with the larger eigenvalues contain more information than components associated with smaller eigenvalues. So you can define some threshold like in the first slide where we could have just dropped pc2. Because it wasn't giving us much additional information.

Key Features of Principal Component Analysis

PCA and Variable Redundancy Reduction

You can do the same thing and kind of truncate your variables after a certain amount of information is captured with your principal components, okay. So just as a recap principle component analysis it reduces input dimensionality and redundancy.

How to Determine the Appropriate Number of Principal Elements

Some key points are new variables are created to be a linear combination of input variables. So that's kind of what we saw in the previous slide, where you had a matrix multiplied by a vector of weights that's equivalent to a linear combination of your input variables. Then each subsequent new variable contains less information.

We kind of saw that once you sorted your eigenvalues from largest to smallest the principal components associated with the larger eigenvalues contain more information and the principal components corresponding to smaller eigenvalues contain less information.

Real-World Applications of PCA

Using PCA for Clustering, Correlation Analysis, and Outlier Detection in Data Science & Finance

There are a lot of applications for PCA relating variables together. So if two variables get kind of clumped together kind of like hot dog bun sold and hot dog sold., there's some underlying correlation. There you can use it for clustering where you can transform your space from your original input space to a new PCA space. Then you can do a clustering algorithm like k-means and then you can also do some outlier identification.

You can plot all your points in your principal component space and just kind of visually inspect. If there are any outliers, all right. So here's a fun example. I guess at the outset. I'm going to say that I'm not a financial advisor. I've never taken a finance class. So in no way is this a recommendation of how you should invest your money.

Using PCA to Analyze the Stock Market: A Fun Example

Creating a PCA-Based S&P 500 Index Fund

So here we're going to use PCA to create a P 500 index fund. So an index fund is basically a set of investments that are meant to follow or track a specific market. The example codes are on GitHub, so I'll probably just fly through this. I used the Yahoo finance module to get real actual stock data. So this is all real data. This isn't made up and then I use pandas and numpy for all the number crunching. So I wrote some code to input the ticker names from Wikipedia and then Graham Guthrie had a nice medium post of how you can grab all these s p 500 names. So I just stole some code from that post and made some edits, okay.

I pulled p 500 data for 2020. I drop nands to get a pandas data frame of just close prices as opposed to all the other available information and get a list of names of ticker names of all the companies in the data frame. So we have 253 rows and 499 columns.

Creating a PCA-Based S&P 500 Index Fund

So here I guess the comments aren't updated. so I apologize for that but here we're initializing PCA with 10 components and then we'll ex. We'll apply PCA to our data set and we'll print the explained variants. So you can see you know the first three components. You're already at more than 90 of the explained variants uh. If you just sum up the first three elements of that array, there um, okay. Then we can create an index fund. 

Comparison of PCA-Based Portfolio to the S&P 500

How Did the PCA-Based Index Perform in 2020?

There are countless ways you can do this. I just arbitrarily took the weights defining. The first three principal components. I summed them together and then I only included the top 61 weights. We can represent the uh overall portfolio of this index fund with a bar plot. It's a natural way to do it. So the y-axis is the relative weight. You can also think of this as the number of dollars a relative number of dollars. You're gonna invest in each specific company and then the x-axis is just the individual ticker names, okay. We can see how our index fund compares to the actual s p 500 over 2020.

PCA for Portfolio Optimization: A Visual Comparison

You know visually approximately it doesn't do such a bad job. There's some discrepancies uh along the way, but everyone cares about percent return. So if you would have just bought one share of every single stock in the s p 500 at the beginning of 2020, then sold those uh same shares at the beginning of 2021. You would have made 20 returns if you had instead followed the investing strategy of this particular index fund derived from PCA.

Conclusion and Next Steps

Key Takeaways from PCA in Stock Market Analysis

You would have made 25. So that was the article on principal component analysis. I hope that clears things up. I have provided a link to my blog post on Medium on the topic stay tuned for the next article.

What’s Next: Independent Component Analysis (ICA)

I'll be talking about a similar but different technique independent component analysis. If you enjoyed this article be sure to like comment subscribe hit the bell share with your friends and family. So they too can learn about principal component analysis. Thanks for reading.

Post a Comment

Previous Post Next Post