Learn how the Mapper algorithm in topological data analysis (TDA) simplifies complex datasets into interactive graphs for exploratory data analysis. This guide covers key steps, real-world applications, and a hands-on example using S&P 500 data.
Table of Contents:
- Introduction to Topological Data Analysis (TDA)
- What is TDA and How It Can Transform Your Data Analysis?
- The Mapper Algorithm: A Powerful Tool in TDA
- Understanding the Mapper Algorithm in TDA
- How the Mapper Algorithm Converts Data into a Graph
- Applications of the Mapper Algorithm in Data Science
- Step-by-Step Breakdown of the Mapper Algorithm
- Step 1: Starting with the Dataset
- Step 2: Projecting Data into a Lower-Dimensional Space
- Step 3: Defining a Cover for the Data Subsets
- Step 4: Clustering and Creating the Mapper Graph
- Implementing the Mapper Algorithm: A Code Walkthrough
- Importing Modules for Mapper and Data Analysis
- Fetching and Preprocessing S&P 500 Data
- Mapper Algorithm in Action: S&P 500 Exploratory Data Analysis
- Dimensionality Reduction with UMAP and Isomap
- Clustering and Visualizing High-dimensional Financial Data
- Customizing the Mapper Algorithm: Cover, Clustering, and More
- Fine-Tuning Your Graph with Clustering and Cover Settings
- Visualizing the Mapper Graph: An Interactive Experience
- Exploring Financial Data with the Mapper Algorithm
- Uncovering Hidden Stock Clusters in S&P 500 Through Percent Return Analysis
- Interactive Graphs for Exploratory Data Analysis
- Visualizing and Interpreting Financial Trends with Mapper
- How to Navigate the Interactive Mapper Graph
- Enhancing Your TDA: Key Insights and Practical Tips
- Experimenting with Projection Strategies and Clustering Algorithms
- How Mapper Can Help You Uncover New Patterns in Complex Data
- Conclusion: What’s Next in Topological Data Analysis
- Exploring the Next Step in TDA: Persistent Homology
- Join the Discussion: Share Your Thoughts and Feedback
Introduction to Topological Data Analysis (TDA)
What is TDA and How It Can Transform Your Data Analysis?
Hey folks, welcome back. This is the second article in a three-part series on topological data analysis or TDA for short. In this article, I'll discuss a technique under the umbrella of TDA called the mapper algorithm.
The Mapper Algorithm: A Powerful Tool in TDA
This approach allows you to translate your data into an interactive graphical representation, enabling exploratory data analysis and finding new patterns in your data. I'll start by discussing how the algorithm works before diving into a concrete example with code. Let's get into the article. So in the previous article, I discussed a famous problem in math called the seven bridges of Koenigsberg.
I will only go into some of the details of the problem, but it was eventually solved by famous mathematician Leonard Euler. The way he solved it was by drawing a picture and this picture is what we now call a graph. So a graph consists of dots connected by lines in more technical terms for these things, the dots are called vertices and the lines are called edges.
Understanding the Mapper Algorithm in TDA
How the Mapper Algorithm Converts Data into a Graph
Another equivalent terminology is instead of calling this thing a graph, we can call it a network. We can call the dots nodes and we can call the lines links. These are all equivalent terminology that I'll use interchangeably for this article. So graphs or networks, they typically represent something from the real world.
So in this case Euler drew a graph representing Konigsberg where each node represented a land mass and the lines connecting two nodes represented a bridge. So what this does is it boils down the problem to its essential elements, which allowed Euler to famously solve this problem as I mentioned in the previous article.
Applications of the Mapper Algorithm in Data Science
What we're doing? When we do topological data analysis? We are translating data from the real world into its essential elements or in other words into its underlying shape. So one way of doing this is via the mapper algorithm and the main topic of this article. The mapper algorithm allows us to translate data into a graph. So key applications of the map or algorithm is one exploratory data analysis. It allows us to take a data set and generate a visually engaging and interactive visualization.
Another application is that it allows you to compress and visualize very high-dimensional data. So imagine trying to visualize a 500-dimensional data set with a mapper algorithm. We can take our data set compress it into a two-dimensional graph and then visualize it and try to highlight it.
Step-by-Step Breakdown of the Mapper Algorithm
Step 1: Starting with the Dataset
At a super high level the mapper algorithm takes data and translates it into a graph but how exactly does it work. I've broken down the algorithm into five steps. I apologize in advance because it's a bit sophisticated. The first step is we start with our data set. Here we have a two-dimensional data set because we have two variables x1 and x2.
Step 2: Projecting Data into a Lower-Dimensional Space
Then the second step is we project our data into a lower dimensional space. So here we're going from two dimensions and we're projecting down to one dimension. We can do this with any dimensionality reduction strategy like we do something standard like PCA. We can do something more sophisticated as we will see in the example. Later another popular strategy is to take just basic statistics to project out into one dimension.
In other words, you could consider two variables x1 and x2. So each point will have a corresponding x1 and x2 value. You could take the average of those two and organize them onto a one-dimensional axis. You could take the max. You could take them in. So they're all these different strategies. We've gone from two dimensions down to one dimension. So nothing too fancy yet.
Step 3: Defining a Cover for the Data Subsets
The next step is we define something called a cover. Basically what this means is we're going to define two subsets indicated by this red circle and this green circle. We will have these two subsets have some overlap. So we can see here that the red subset and the green subset indeed have some overlap and these are indicated by the yellow points in the center of this picture.
So that's what we mean by cover. We just define a collection of subsets that have some overlap which include the entirety of the data set. Another thing is we could do more than just two subsets. We could have three subsets four subsets and so on. But just for this toy example, I chose two because it's easy to see what's going on here, okay.
Step 4: Clustering and Creating the Mapper Graph
The fourth step is we cluster the pre-image. There's a lot of jargon here. So I'm just gonna break it down. So if we look at step three, we have red points green points, and yellow points but we remember that each of these points has a corresponding point in our original data set. What's being shown in this picture in step four is our original data set. But the points are colored based on which subset, they appear in from step three.
Implementing the Mapper Algorithm: A Code Walkthrough
Importing Modules for Mapper and Data Analysis
The next step is we're going to iteratively go through all our subsets. So we only have two subsets. We have a red subset and a green subset and we're going to apply our favorite clustering algorithm. We'll start with this red subset. So in other words, we're going to look at the red and yellow points only. We're going to do a clustering algorithm. Let's say it looks something like this, then we will go to our next subset which is the green and yellow points here. We will cluster those and let's say we get something like this. Now we have these four clusters defined with some overlap between them.
Fetching and Preprocessing S&P 500 Data
Now, we're set up to create a graph. We can create a graph where the nodes are these clusters. So four nodes correspond to four clusters and then two nodes are connected by an edge. If the clusters have shared members. So this middle cluster shares members with the other three. That's what's being shown here, okay. This is just a toy example.
I hope that was somewhat clear of what's going on here, but I'm going to try to make things more concrete with an example with code. In this example, we're going to do exploratory data analysis of SP 500 data.
Mapper Algorithm in Action: S&P 500 Exploratory Data Analysis
Dimensionality Reduction with UMAP and Isomap
So our first step is to import some modules. We have the yahoo finance module which allows us to get the stock data. We have this k-mapper module which allows us to do the mapper algorithm stuff. We're importing this UMAP module sklearn and then something from sklearn. We're using these for our dimensionality reduction and then we have numpy and matplotlib to do some standard math and visualization stuff, okay.
Clustering and Visualizing High-dimensional Financial Data
The first step as with any data science project is you're gonna get your data. So this is pretty straightforward. You just define your ticker names and you define the date range for which you want your data and then with one line of code, you can pull all that data. So this code is available on the GitHub. Once we have our data, we can do some more preparation to make it ready to go to do our analysis. The first step is we're just going to look at adjusted close prices and so now what you can imagine is we have columns corresponding to ticker names and then we have rows that correspond to days that the markets open.
What we're going to do is convert this panda data frame into a Numpy array. We're going to standardize each of the columns. So basically, what that means is we're going to consider a column compute its mean and standard deviation and then we're going to subtract the mean from each value in this column. We're going to divide it by the standard deviation.
The last step here is we do a transpose just because later this will allow us to compare takers together as opposed to days. We could also not do a transpose and then the analysis wouldn't. So much by comparing different tickers together but comparing days that the market was open and then the last step here is we're going to compute the percent return of each of the tickers. Because later, when we generate this interactive network, we can color the nodes in the network based on the percent return value of each of the tickers, okay.
So all this talking and explaining and we still haven't really gotten into any topological data analysis. If we think back to that visual overview from earlier, this is all still. We're still getting our data. So now we can finally get into the mapper algorithm stuff. First, we will initialize this object.
Customizing the Mapper Algorithm: Cover, Clustering, and More
Fine-Tuning Your Graph with Clustering and Cover Settings
Next, we're gonna do step two in the process which is to project our data into a lower dimensional space. We actually have 495 tickers here and what we're going to do is project that down into two dimensions. The way we do this is a two-step process. First, we use the iso map from this manifold library in sklearn. So that'll take us from 495 dimensions down to 100. Then we'll use umap which will take us further from 100 down to two dimensions. So the nice thing about this syntax is we can define a custom data pipeline to do our dimensionality reduction. So we can see this projection keyword is being set to a list and this list is actually a list of function elements.
This list is manifold. isomap with all the input arguments there and then the second element of the list is umap with all its input arguments. But we could have gone further. We could have added a third element and made that PCA which took us from two components down to one component or we could have done a completely different data processing pipeline.
You can already start to see that you have a lot of flexibility in using the mapper algorithm in practice. We essentially will combine steps three, four, and five from the overview earlier into one line of code. So defining a cover clustering the pre-image and generating an output graph is all compressed down to a single function call in that.
Visualizing the Mapper Graph: An Interactive Experience
We pass in the projected data from the previous step. The original data set and we defined the clustering strategy that we want to use here. We use the db scan with a cosine similarity metric. You can also customize the details of the cover but here we're just using the default values and in less than a second it generates the graph.
Exploring Financial Data with the Mapper Algorithm
Uncovering Hidden Stock Clusters in S&P 500 Through Percent Return Analysis
The next step here, I define a file ID which isn't really necessary. I just like to do it because every time I've used the mapper algorithm, I'll try different choices of cover. I'll try different projection strategies. I'll try different clustering algorithms and so on. I'll typically have these going in a for loop and I don't want the output graphs to get overwritten. So I'll define this file ID which will automatically generate a unique file name for each output graph.
Interactive Graphs for Exploratory Data Analysis
Visualizing and Interpreting Financial Trends with Mapper
The last step is we visualize the network. You just passed in the graph. You define a file name. You can give the graph a title. You can have these custom tooltips which are the label for each of the members. So basically these are our ticker names. We can define color values which we will define as the log percent returns.
We can give a name to the color function and then we can also have multiple options of how these color values are aggregated. Though we could just do a simple average. We could compute the standard deviation. The sum, the max the min, and so on.
How to Navigate the Interactive Mapper Graph
So what the output of the mapper algorithm looks like is something like this. It actually generates a web page that allows you to interact with the graph and lends itself very well to exploring our data analysis which we're doing right now. The code that we just walked through will actually generate an HTML file which we can go ahead and open. So first look, this does not look like the network. I showed earlier but if we go to this help menu, we will find different viewing options.
We can click on our keyboard to do a tight layout and already it's starting to look a bit nicer and then we can click p for print mode which will just give the graph a white background. Next, we can click on any node. We like and it'll start to kind of radiate this glow and then we can go over to this cluster details click on this plus sign and so remember that the nodes in this network are actually clusters of data points and then the way we do the analysis. Here is we actually have clusters of tickers or in other words stocks so down.
Enhancing Your TDA: Key Insights and Practical Tips
Experimenting with Projection Strategies and Clustering Algorithms
Here, you'll see the names of the members of this cluster listed. This was generated from that custom tooltip option in that last function call. We made and then here we also have a histogram that shows showing distribution of the log percent returns of the members of this selected cluster. So right now the weighted average of the log percent returns of each cluster is what generates each node's color.
We could use other statistics. So if we go over here to this node color function, we can click this drop-down menu. We could do the standard deviation which doesn't look too exciting. We could also do the sum which also looks pretty uniform, but then we could also do max.
How Mapper Can Help You Uncover New Patterns in Complex Data
Now, we're starting to see some variation. We then might be curious about the clusters that contain members with high returns. So we can click on this yellow node here and then we can look at these ticker names and maybe do some further analysis, so I'm no financial expert. I don't have much intuition to offer here, but when working with data that you are familiar with you may immediately start to see interesting patterns just by jumping around and you can really do this all day.
Conclusion: What’s Next in Topological Data Analysis
Exploring the Next Step in TDA: Persistent Homology
You can click on a particular node to see, what members are in that cluster and then you can click on adjacent nodes and see what members are in those clusters and then you can go back and try out different projection strategies. Try out different clustering algorithms to generate new graphs and then repeat this whole process.
All right, so that's basically it again the code for this example is freely available on GitHub. If you want to learn more check out other articles in the series. In the next article, I will discuss another specific tda technique called persistent homology.
Join the Discussion: Share Your Thoughts and Feedback
If you enjoyed this content, please consider liking subscribing, and sharing this article like many of you I am still learning. So I would also appreciate your questions concerns and feedback in the comments section below as always thanks for reading.