Persistent Homology Explained | Introduction & Python Example Code - icoversai

Discover how Persistent Homology in Topological Data Analysis (TDA) uncovers hidden patterns in multi-dimensional data. Explore concepts, real-world financial market analysis, and Python code examples to enhance your understanding of this cutting-edge technique.


Table of Contents:

  • Introduction to Persistent Homology in Topological Data Analysis (TDA)
    • What is Persistent Homology? Understanding the Core Concepts
    • How Persistent Homology Helps Uncover Hidden Patterns in Data
  • The Building Blocks: Understanding Simplexes and Simplicial Complexes
    • From Triangles to Simplexes: Generalizing Shapes for Multi-Dimensional Data
    • What is a Simplicial Complex? Key to Persistent Homology
  • Homology and Holes: Characterizing Shapes in Data
    • How to Compare Shapes Using Homology Groups
    • The Importance of Holes in Persistent Homology
  • Step-by-Step Guide: Applying Persistent Homology to a Point Cloud
    • Constructing Simplicial Complexes from Point Clouds
    • Tracking Changes in Shape: The Circle Growing Process
  • Persistence Diagrams: A Tool for Visualizing Significant Features
    • What is a Persistence Diagram? Interpreting Data Shapes
    • Using Persistence Diagrams to Differentiate Noise from Significant Features
  • Real-World Example: Using Persistent Homology in Market Data Analysis
    • Applying Persistent Homology to Financial Markets Data
  • Exploring Market Shape Changes with Python and Scikit-TDA
    • Python Code Example: Calculating Homology from Financial Data
    • How to Use Wasserstein Distance to Measure Homology Changes
  • Interpretation of Results: Did Persistent Homology Predict the 2020 Market Crash?
    • Visualizing Homology Changes Over Time: What the Results Tell Us
    • Limitations of Persistent Homology in Financial Data Analysis
  • Key Takeaways from the Persistent Homology Example
    • How Persistent Homology Offers a New Perspective on Data Analysis
    • Future Research Directions in Applying Persistent Homology
  • Conclusion: Wrapping Up the Three-Part Series on TDA
    • Why Topological Data Analysis is a Growing Field in Data Science
    • Learn More About TDA: Resources and Further Reading


Introduction to Persistent Homology in Topological Data Analysis (TDA)

What is Persistent Homology? Understanding the Core Concepts

Hey folks! Welcome back. This is the final article in a three-part series on topological data analysis or tda for short. In this article, I'll be talking about another specific technique under the umbrella of tda called persistent homology. The big idea behind persistent homology is finding the core topological features of your data that are hopefully robust to noise.

I'll start with a brief discussion of key points surrounding persistent homology and then dive into a concrete example with code of how to use it and with that, let's get into the article.

How Persistent Homology Helps Uncover Hidden Patterns in Data

There are many layers to persistent homology. So I'll try to start super simple and build things up in a way that hopefully makes some sense like, I mentioned throughout this series.

So let's go back to preschool and talk about shapes or more precisely polygons like the ones shown here, but not all polygons are equal. There's one that is special and the reason it is special is because it is the simplest polygon we can construct the triangle.

The Building Blocks: Understanding Simplexes and Simplicial Complexes

From Triangles to Simplexes: Generalizing Shapes for Multi-Dimensional Data

One neat thing about triangles is that we can use them to make any other polygon. For example, a square is really just two triangles stuck together a pentagon can be made from four triangles like this. The star is just the same pentagon but with five triangles coming out of it. So one thought is if we want to analyze the shape of our data, maybe we can break it down into a bunch of triangles as well as it turns out. This is essentially what we do in persistent homology. But with one technical detail.

Since most data sets live in more than just two dimensions, that's to say we have more than just two variables flat two-dimensional triangles may not capture the full richness of our data's shape. Don't worry like most things mathematicians have generalized the notion of a triangle to any number of dimensions and they call these generalized triangles simplexes. So the triangle that we know and love is called a too simplex.

Since it lives in two dimensions a line segment is the simplest shape. We can construct in one dimension and it's called a one simplex. Similarly, a tetrahedron is called a three simplex and a point is a zero simplex, and so on for all the other dimensions. So just like a collection of triangles can make any two-dimensional polygon, a collection of simplexes can approximate just about any complicated high-dimensional shape that may underlie our data.

What is a Simplicial Complex? Key to Persistent Homology

Since you'll probably see it elsewhere. The technical name for a collection of simplexes is called a simplicial complex. This is a key concept in persistent homology, okay. So this gives us a clue as to how we can take unstructured point clouds. In other words, data sets and translated into shapes.

Homology and Holes: Characterizing Shapes in Data

How to Compare Shapes Using Homology Groups

So now, let's talk about how we might compare shapes together. No matter how different or complicated they may seem. So one way to do this is by looking at poles.

For example, in these three objects shown here, we have a Taurus a loop, and a coffee mug. So while these may appear to be very different shapes, they have something fundamental in common. They all have a hole and this is like the joke that a topologist looks at a coffee mug and a donut and sees the same thing. The reason is that one can continuously transform one into the other for the aficionados out there, this is called a homomorphism.

The Importance of Holes in Persistent Homology

The fundamental thing here is the number of holes. So one way we can characterize and group shapes together is by counting holes and just like before, when we generalize triangles into simplexes. We can generalize holes as well we can think of cavities as holes in 3d.

We can think of singly connected components as holes in 1d. So these generalized holes form the basis of what are called homology groups and these give us a formal way to characterize different shapes. So when we talk about homology we are essentially just talking about holes, okay.

Step-by-Step Guide: Applying Persistent Homology to a Point Cloud

Constructing Simplicial Complexes from Point Clouds

Now, that we've talked about constructing shapes with generalized triangles and characterizing those shapes via generalized holes.

The first step in persistent homology is to convert data into a simplicial complex to see this. Consider a data set, i.e. a point cloud like this, and one way we can construct a simplicial complex out of this is by drawing n-dimensional balls around each point. Since our data here is two-dimensional we just draw circles around each point which might look something like this so at the center of each of these gray circles. We have a point. We can form one simplex, i.e. line segments by connecting the data points whose corresponding circles overlap which might look something like this.

Tracking Changes in Shape: The Circle Growing Process

Now, we have two shapes. We have our original point cloud which is indeed a simplicial complex, where each point is a zero simplex and the shape. We just constructed made up of both zero and one simplexes and then we can compare these two shapes by looking at their homology more specifically by counting the number of connected components which corresponds to the homology group, that we talked about in the previous slide.

We can see that in our first shape on the left. We have 20 separate connected components. While on the right here, we have 13 singly connected components but there's nothing special about this radius epsilon sub 1. so let's do this again but with bigger circles. Now we can start to see two simplexes appear. In other words, triangles and the number of connected components decreases but still, there's nothing special about epsilon2. So let's go even bigger and now we see three simplexes appear.

However, there is a special radius value here which is when every circle overlaps with every other circle. We are just left with one big connected component and this is a natural limit to this process as we can see with each of these simplicial complexes. The shape of our data is evolving and its evolution is captured and quantified by the number of connected components.

In other words, by the change in its homology. Although only four different choices of radii are shown here corresponding to the four different shapes on screen. We can do this for every choice of radius between zero and the limit I mentioned earlier. So this gives us a way to suss out which topological features of our data are significant based on how long they persist during this circle-growing process.

Persistence Diagrams: A Tool for Visualizing Significant Features

What is a Persistence Diagram? Interpreting Data Shapes

In other words, the holes that persist over a large increase in radii are more significant than the ones that persist over just a short period, okay. So how can we track the persistence of these holes? So one good way to do this is by using a persistence diagram. These look something like the plot on the left here which shows the persistence diagram of a hollow sphere and looking at the plot each of these blue orange and green points corresponds to a topological feature.

In other words, a blue hole. We have the h sub zero homology group which are the singly connected components in orange. We have the h1 homology group which are closed loops and in green. We have the h2 homology group.

In other words, cavities the x-axis of this plot indicates the radius at which a hole appeared in the evolution of the data shape. In other words, in this circle growing process that we showed in this previous slide and on the y-axis, we have the radius at which that hole disappeared.

Using Persistence Diagrams to Differentiate Noise from Significant Features

Therefore a point that sits near this black dashed line in this y equals x line corresponds to a hole that disappeared soon after it appeared conversely points that sit far away from this line represent holes that disappeared long after they appeared. Therefore two key points of a persistence diagram are the points close to this y equals x line are noise. While the points relatively far from this line are significant.

In this example, we have two points that are far from this line. The blue one is in the top left here and the green one is right here. So we can ignore this blue one here because this corresponds to when every n-dimensional ball overlaps with every other ball.

So the significant topological feature of this data is captured by this green point here which tells us that the data is characterized by one cavity and this makes sense. Since the data for this example are organized on the surface of a sphere, okay. So up until this point, I've discussed only toy examples and meant to give you an idea of what's going on with persistent homology.

Real-World Example: Using Persistent Homology in Market Data Analysis

Applying Persistent Homology to Financial Markets Data

Now, we'll switch gears to an example with real-world data. So in this example, we'll walk through how one could use persistent homology to analyze market data. I suppose it's worth mentioning that this example is not meant as financial advice. I'm a physicist, not a trader never taken a finance class in my life.

Exploring Market Shape Changes with Python and Scikit-TDA

Python Code Example: Calculating Homology from Financial Data

However, I hope this example gives you an idea of what an analysis using persistent homology might look like and inspires ideas for analyses using data that you might be working with, okay. So similar to the last article, we start by importing Python libraries. The notable libraries here are finance which gives us an API to grab market data and the ripster and persim modules which are part of the same scikit tda ecosystem from the last article.

Next, we load market data over four years using why finance is here. We are grabbing four major market indexes namely the s p 500 Dow Jones Nasdaq and Russell 2000. We have daily prices for these indexes organized in a pandas data frame. So you can imagine four columns for each market index and many rows corresponding to each day that the markets were open over these four years. Then we convert this pandas data frame into a numpy array and compute the long daily returns of each index.

This choice of data prep follows the procedure used in the paper by Gideon Cats which was the inspiration for this example. You can find it at the archive reference here, okay. So now, we get into the tda stuff. In this analysis, we want to track changes in the shape of the markets by looking at how the homology of the market changes over time. So to do this, we start by initializing this object that constructs simplicial complexes from data.

Next, we define a time window size which will allow us to grab a chunk of data to analyze the homology of. So here we're sending this window size to 20 days. Next, we define the total number of these chunks. We will have and finally. We create a numpy array to keep track of a number that quantifies changes in homology, okay,

Next, we go down to this for loop and we do some persistent homology. So first we take the first 20 rows of data to do persistent homology and create a persistence diagram that is we grow four-dimensional balls around each point where each choice of radius creates the simplest complex. We track the holes that appear and disappear using a persistence diagram. So we do all that with just one line of code and we do the same thing, but now for another set of 20 rows specifically. The second row is all the way down to the 21st row.

How to Use Wasserstein Distance to Measure Homology Change

So now, we have two persistence diagrams corresponding to two overlapping 20-day windows in which the market was open. So next we can quantify the change in the overall homology between these two persistence diagrams using something called the washerstein distance which is essentially a distance measure between two persistence diagrams.

So at the end of this whole process, we get a single number and store it in the numpy array. We created earlier then we repeat this whole process for all the rows in our data set, okay. So after this whole process, we have a set of values that quantify the changes in homology between consecutive days that the market was open.

Interpretation of Results: Did Persistent Homology Predict the 2020 Market Crash?

Visualizing Homology Changes Over Time: What the Results Tell Us

We can just plot this as a time series which is what's happening in this block of code here. The plot will look like this blue line here, in which we can see there's this clear peak near the middle of the time series, and then for some context, we also have scaled s p 500 close prices plotted in orange just above. This vertical red line here indicates, when the crash of 2020 occurred, and then as it turns out the peak in this washer sign distance time series seems to correspond very closely with when this crash occurred.

Limitations of Persistent Homology in Financial Data Analysis

So did homology changes predict the crash of 2020. Well, I wouldn't go that far but this is indeed interesting. One idea to investigate this, further is one could try to use these washer sign distances to predict future market index prices. So if past distance values predict future index prices then maybe there's something here. So as you may be able to see from this example.

.Key Takeaways from the Persistent Homology Example

How Persistent Homology Offers a New Perspective on Data Analysis

There is a lot of room for creativity when using persistent homology in practice and in some sense. This is more art than science.

Future Research Directions in Applying Persistent Homology

So that brings us to the end of our three-part series on topological data analysis. A TDA is a young field with a lot of untapped potential. So I hope this series helped get a better idea of what it's all about. If you'd like to learn more, check out the other articles in this series linked in the description below.

Conclusion: Wrapping Up the Three-Part Series on TDA

Why Topological Data Analysis is a Growing Field in Data Science

There's also a corresponding medium article to this article and the others in this series which you can find in the description. If you enjoyed this content please consider liking subscribing or sharing this article like many of you.

Learn More About TDA: Resources and Further Reading

I am indeed still learning. So if you have thoughts questions or concerns please feel free to share those in the comments section below and as always thanks for reading.

Post a Comment

Previous Post Next Post