Dimensionality Reduction & Segmentation with Decision Trees | Python Code - icoversai

Discover how to compare machine learning models, select features, and optimize performance using logistic regression and decision trees. Learn practical strategies for handling class imbalance and creating data segments to improve model interpretability and accuracy.


Table of Contents:

  • Introduction to Model Comparison in Machine Learning
    • Understand the Basics of Model Performance Evaluation
  • Feature Selection and Logistic Regression: A Step-by-Step Guide
    • How to Train a Logistic Regression Model with Incremental Features
    • How to Interpret a Logistic Regression Model's Coefficients
  • Comparing Logistic Regression with Random Forest Models
    • When Logistic Regression Outperforms More Complex Models
    • Visualizing Model Performance with AUC Values
  • The Importance of Model Interpretability in Machine Learning Domain
    • Balancing Accuracy and Interpretability in Model Selection
  • Handling Class Imbalance: The Role of Resampling Techniques
    • Why Synthetic Oversampling Affects Model Output Probabilities
  • Segmentation of Predictor Variables for Improved Model Performance
    • Decision Trees for Predicting Sepsis Survival: A Case Study
  • Creating Age-Based Segments for Better Decision Tree Performance
    • How to Manually Segment Data Based on Domain Expertise
    • Using Decision Trees to Automate Age Segmentation
  • Training a Decision Tree Model for Optimized Age Segmentation
    • Step-by-Step Guide to Preparing Data for Decision Tree Segmentation
    • Understanding the Output of Decision Tree-Based Segmentation
  • Advantages of Decision Trees in Defining Data Segments
    • How Decision Trees Optimize Age Buckets for Sepsis Prediction
  • Conclusion and Further Learning Resources
    • Implementing These Techniques in Your Own Machine Learning Projects


Introduction to Model Comparison in Machine Learning

Understand the Basics of Model Performance Evaluation

Hey everyone, welcome back. I'm Icoversai, and in this article, I'm going to continue the series on decision trees and talk about a couple of applications. In the last two articles of the series, we talked about how we can train predictive models using decision trees. So in the first article, we just talked about models employing a single decision tree then in the second article we expanded this idea to tree ensembles. So if you haven't already be sure to check those out because we're going to build upon those ideas in this article. 

The whole point of the discussion today is that we can use machine learning models specifically decision trees for more than just making predictions so this is what I'll call Next Level uses of decision trees not because they're anything profound or groundbreaking but because they go beyond this obvious task of just using a machine learning model to make a prediction and I think for those who are just getting started in data science, it may be easy to think that all there is to data science is getting some data training a model and making predictions. 

Somehow through this process, there's going to be immediate real-world impact in value and the reality is it's not so straightforward. So what I really like about data science is the critical thinking and the creativity that's required to use these tools and techniques to solve real-world problems and provide value. That's all I mean by "Next Level". 

I'll talk about two ways we can use decision trees for more than just making predictions. So the first is reducing predictor count and the second is called predictor segmentation. 

Feature Selection and Logistic Regression: A Step-by-Step Guide

How to Interpreter a Logistic Regression Model with Incremental Features

Starting with the first one reduce the predictor count. This goes back to the previous article of the series where we talked about tree ensembles where we stitch together a bunch of decision trees, making our machine learning model more robust. Decision tree ensembles give us so many great things as I was talking about in the previous article. All these great things come at a cost which which is that three ensembles are a bit of a black box. So we know what we put into the tree Ensemble and we can see what comes out of it.

However, we don't have to use our tree Ensemble to make our final predictions rather what we can do is take the feature importance ranking from our tree Ensemble model and use that to inform a simpler set of predictor variables. 

So not only does this help us in interpreting, what the model is doing. But in a lot of cases, it can actually lead to an improvement in predictive performance. So walking through what this might look like we take our tree Ensemble and spit out our feature importance ranking. Then what we can do is take the top predictor and train a machine-learning model from it.

A decision tree could be a logistic regression model, linear model, and neural network. It really doesn't matter what kind of model we use for it. We use one predictor to develop a model and then we assess that model's performance. Then we can do the same exact thing for the top two predictors. We grab its performance metrics for three models four models and so on and so forth.

Once we go through this process, we can plot a chart that looks something like this. We're on the x-axis. We have the number of variables included in the model and then on the y-axis, we have some performance measures. So here I put AUC just as an example. Each of these points corresponds to a different model and we can see the gain and predictive performance here. 

For example, we kind of see that we have pretty big gains until we hit about three variables. So what this tells us is maybe we don't need all six variables and we can jus get away with using three of them without a major loss in predictive performance and so the upside of that like I mentioned earlier is that a model with three inputs variables is a little easier to interpret than a model with 6 or 60 input variables. 

So now I'm going to walk through some example code of doing this. I'm going to use the same data set that I used in the previous article. In this series, there's a bit of overlap between the code here and from the previous article. So I'm not going to spend too much time on any recurring details, but if you want to learn more check out that article also the code is available at the GitHub link down here and Linked In the description below. 

First Step as always is importing modules. This is a lot of the same stuff as the example code from the previous article with the only addition of this import here where we're bringing in a logistic regression model from Sklearn. Then as always we're going to load in our data. 

Next, we're doing some data prep, this is very similar to what we saw in the previous article. This is something new where all I'm doing here. Since Y is a Boolean variable meaning, it can only take values of zero or one. All I'm doing is switching the meaning of zero and one so originally zero meant the tumor was malignant and one meant the tumor was benign and all I'm doing with this transformation is switching. One means the cancer is malignant and zero indicates the cancer is benign.

How to Interpret the Coefficients in a Logistic Regression Model

When it comes to interpreting what the coefficients of our logistic regression model mean, it's just a bit more intuitive to talk about things in terms of risk of breast cancer as opposed to the opposite which would be like safety from breast cancer. Then this should also be a review where we're using smoke to balance our imbalance data set. So we have way more benign cases than malignant cases. So all Smote is doing is synthetically over-sampling the minority class and then we're using this train test split function to create our training and testing data sets, Okay.

Comparing Logistic Regression with Random Forest Models

When Logistic Regression Outperforms More Complex Models

Now we can train our random Forest. So this is one of the three Ensemble methods we saw in the previous article. So we can fit that model with just a couple lines of code. Next, we have something new everything up until this point. We basically did in the previous article but now we're kind of going into some novelty. So all we're doing here is pulling the feature importances from our random forest model and then we're sorting them in descending order. 

What this looks like is that these are all the names of our features and these numbers here quantify their relative importance. Now what we can do is exactly the process. I described before or retrained a model using the top predictor and assessed its performance and we trained another model using the top two predictors assess. The performance top three sets perform so on and so forth.

What this looks like in code could be something like this where we're initializing all these lists to store our classifiers and then to store our different performance measures. We can ignore this I equals zero here. This is just leftover from an earlier version of the code. What we're doing here is corresponding to the number of elements in this series. We're going to go through one by one and do the following block of code.

Visualizing Model Performance with AUC Values

What we're doing here is listing the feature names up until I plus one. We train our logistic regression model using this line of code and then we're just appending things to these lists from before. We're pending the classifier to the classifier list. We're appending the AUC value for the training data set and then we're appending the AUC value for the testing data set. If this I confusing and complicated, don't worry. What matters is this final result which is just like what we saw before which is our number of variables plotted on the x-axis and then the performance of the models on the y-axis. 

This red dashed line is the AUC value for the random forest model we trained originally. So this is a tree Ensemble model that uses all 30 predictor variables but what's really remarkable is that once we hit five variables the logistic regression model actually outperforms. 

The Importance of Model Interpretability in Machine Learning Domain

Balancing Accuracy and Interpretability in Model Selection

This more sophisticated model uses six times as many variables and then you can see after five variables. The logistic regression models just keep getting better and better. Let's say for our purposes we really value being able to interpret what the model is doing as well as the model's accuracy.

Once we beat our random force model, we're satisfied. So that's the model we're going to use and then since logistic regression is a linear model, we can easily interpret the relationship between our predictor variables and the target variable by looking at the model coefficient. So the bars here are just showing the coefficient values looking at the worst perimeter which is about 0.3 the way to interpret.

This is a unit increase and the worst perimeter translates to a 0.3 increase in the log odds that the tumor is malignant. I know that was a mouthful making that a bit more qualitative as the worst perimeter increases the probability that the tumor is malignant also increases. 

Handling Class Imbalance: The Role of Resampling Techniques

Why Synthetic Oversampling Affects Model Output Probabilities

Now we kind of have the concrete quantification of the interaction between our predictor variables and the probability that the tumor is malignant. So there's a small technical detail here that I don't want to spend too much time on but I talk about more in the blog and it has to do with the resampling. 

Since we use mode to synthetically oversample the minority class. We can't immediately translate our logistic regression model outputs to probabilities and that's just because the y-intercept for our logistic regression model is biased due to the over sampling. So there's a simple fix. There we can just adjust our y-intercept to make it not biased and then everything works perfectly. So if you want to learn more about that check out the blog, okay. 

Segmentation of Predictor Variables for Improved Model Performance

Decision Trees for Predicting Sepsis Survival: A Case Study

Next, we have predictor variable segmentation and this actually goes back to the first blog. In this series, we used a decision tree model for sepsis survival prediction and there the final decision tree. We had looked like this and what's interesting here is even though we had three predictor variables. The vast majority of these splits are only using age. 

Here the initial split is split on ages less than or equal to 58.5 years and then 44.5, 78.5, 56.5, 67, and 86. So this is really interesting what this is indicating is that, when it comes to sepsis survival age is the most important risk factor, we have in our data set. The other ones we had were the sex of the patient and also the number of previous sepsis episodes. So sometimes in cases like this where there's one predictor variable that has this outsized impact on our Target it.

Creating Age-Based Segments for Better Decision Tree Performance

How to Manually Segment Data Based on Domain Expertise

Can make sense to do segmentation on that predictor variable? What that means is all we're doing is taking this continuous variable age and we're going to partition it into discrete sections. So kind of looking at this visually. Let's say we have ages on our data set ranging from zero to a hundred. All segmentation does is split these ages into some number of subcategories. So let's say we want to split it into five subcategories and then the result looks like this. 

What you can do now is instead of training a decision Tree on all of your data. You can train separate decision trees for each age group and what this can translate to is better model performance especially. If there are systematic differences between these age groups it requires separate model development. So the question is how do we come up with these segments? We can definitely do it manually. We just kind of look at the data and say okay. Let's do this age group and that age group or use some kind of subject matter expertise. 

Using Decision Trees to Automate Age Segmentation

But another way we can do it is using a decision tree. So this picture here shows how we can come up with these segments using a decision tree. You notice that age is actually being split into these different sections based on the sepsis outcome of dead or alive. 

Maybe we wouldn't want to use this decision tree directly because it has this other variable involved in the splits. So what we can do is train another decision tree model. But now instead of using the three predictors of age sex and number of sepsis episodes, we can just use one variable. We care about which is age. 

Training a Decision Tree Model for Optimized Age Segmentation

Step-by-Step Guide to Preparing Data for Decision Tree Segmentation

Now, I'm going to walk through what that looks like. I'm going to use the same data set from the first article of this series and then as always, we're going to start by importing our modules. So these shouldn't be anything new. Next, we're going to load our data just like we did in the first article and then we're going to do some data prep. Here all we're doing is keeping the variables of age and the sepsis survival flag. 

Now we're going to do a little bit of data prep. So what we're doing here is grouping the data based on age. So you can imagine that we have all these different patients and there can be multiple patients of the same age. So all we're doing here is reshuffling the data to have only unique age values, but then for each of these unique age values, we're going to have a percent of patients that are alive. 

We're going to name this column percent alive and then on the flip side, we can take 1 minus. The percent alive and create a new column called percent not alive. So the result of that is a data frame that looks like this. Now we only have unique age values starting from zero and going all the way up to a hundred and then for each age value, we have the percentage of them that are live and the percentage of them that are dead. 

Understanding the Output of Decision Tree-Based Segmentation

Next, we're doing here is grabbing the variable names and creating separate data frames for our input and Target variables. So here the predicted variable is age the target variable is going to be the percent not alive and as a first pass to the relationship. We can just plot them against each other, it's on the x-axis we have age and on the y-axis, we have percent not alive. 

So as the percentage of not alive goes up that's an indication that the risk of sepsis increases. You can see around midlife there's, this clear uptrend of the percent of patients that are not surviving their sepsis episode, but before that, this risk is relatively low and stable and so just looking at this plot. We could probably chop up this data into any number of segments.

On this risk, so maybe we would do zero to 40, 40 to 60, 60, 80, and 100 like whatever. But this is just us eyeballing it. It'll be interesting to compare this intuition to what the decision tree is going to spit out, okay. We can now train in our decision tree model. So here we can Define our number of bins by controlling the maximum number of leaf nodes in our decision tree regressor. 

The reason this works is that as we saw in the first article, a fully grown Decision Tree on this data is just massive. So you can virtually have any number of bins that you like and it'll work. Then finally we just fit our data into the decision tree and now with the decision Tree in hand. We can go in and grab all the split values in an automated way.

Advantages of Decision Trees in Defining Data Segments

How Decision Trees Optimize Age Buckets for Sepsis Prediction

This code is a bit involved, so I won't spend too much time on it. But for those who are curious, you can take a look at it here. It's also available at the GitHub repository linked here. But the final result looks something like this. So here we have the same plot from before where we have age and years plotted against the percent not alive. So this is qualitatively pretty similar to what we were talking about before like maybe we would have put one here and then adjusted the rest. 

This is kind of a tricky problem because if I shift this border from here to here to make this first bin look a little better, now this bin may not look as good. After all, now you're mixing together these lower-risk patients with these higher-risk patients, and then from a treatment standpoint that may not make a whole lot of sense. That's one of the upsides of using a decision tree and leveraging that greedy search to Define these bins because it already is doing that tricky optimization for us. 

Then as a final note, I'll just say to take all these with a grain of salt just because the decision tree spits out these optimal age buckets. This may not translate well to treatment strategies and so as opposed to just taking this as gospel. This is more of a starting place and may serve better than just arbitrarily drawing lines for these different age groups.

Conclusion and Further Learning Resources

Implementing These Techniques to Your Own Machine Learning Projects

If you want to learn more, be sure to check out the blog published on Medium and Linked In the description below. Feel free to steal the code from the GitHub repository and apply it to use cases or projects that you're working on. 

If you enjoyed this content please consider liking subscribing and sharing your thoughts in the comments section below. I do read all the comments and I find all the questions and feedback that I receive very valuable as always thank you for your time and thanks for reading.

Post a Comment

Previous Post Next Post