10 Decision Trees are Better Than 1 | Random Forest & AdaBoost

Explore the world of decision tree ensembles with our comprehensive guide. Learn about bagging, boosting, and practical implementations using Python. Discover how ensemble methods enhance model performance, reduce overfitting, and provide robust feature importance rankings. Dive into practical examples and understand the benefits of using these advanced techniques in machine learning.

Introduction to Decision Tree Ensembles

What Are Decision Tree Ensembles?

Hey everyone, I'm Icoversai. In this article, I'm going to continue the series on decision trees and talk about decision tree ensembles. So instead of just using one decision tree. A tree Ensemble combines a collection of decision trees into a single model. So with that, let's get into the article. A decision tree Ensemble is a collection of decision trees that are combined into a single model.

So if you recall from the previous article. We saw decision trees as a way, we can make predictions through a series of yes or no questions. They look something like this you start at the top note here and just follow the arrowheads based on predictor variable values which eventually lead you to your final prediction.

Why Use Decision Tree Ensembles?

On the other hand, a decision tree Ensemble will look something more like this. So now instead of a single decision tree, we have multiple decision trees each giving a prediction and then we can combine these predictions together to give us our final estimation. The key benefit of a decision tree Ensemble is that it generally performs better than any single decision tree alone. We'll kind of touch on why this is the case a little later in the article.

Understanding Decision Trees

Basics of Decision Trees

First, I'm going to talk about two different types of decision tree ensembles. The first one is bagging which is short for bootstrap aggregation or bootstrap aggregation bootstrapped aggregation. The first is short for bootstrapped aggregation and then the second one is called boosting which isn't sure for anything starting with bagging.

How Decision Trees Make Predictions

Here the idea is to train a set of decision trees one at a time by randomly sampling with replacement. So what the heck does this mean. So we'll just walk through this one step at a time. Say we start with our training data set T naught and each of these blocks here represents a different record or example in our training data set. So we have record one. We have records two, three, four, and five.

What we do here is create another training data set by randomly sampling T naught with replacement. What this might look like is this where we randomly pick five records from the original training data set. So notice record number, three actually shows up twice in this training data set. This just follows from sampling with replacement. So basically all that means is every time we pick a record from T naught for T1 we will replace it before making a second pick. So we have this new training data set T1. We can just do the same thing and we get T2. So let's say this time we really got a lot of two through this random sample and then so on and so forth.

Then let's say the nth training data set looks something like this. So now notice instead of just a single training data set, we have a collection of training data sets which allows us to train a collection of decision trees. So that could look something like this and then just like we saw on that first slide we can combine the predictions from each of these decision trees to produce our final prediction.

Types of Decision Tree Ensembles

Introduction to Bagging and Boosting

Random forest is one of the most popular machine-learning algorithms that uses bagging. I'm not going to get into all the details of that in this article, but for those interested, be sure to check out the blog published in Towards Data Science where I give a few more details about random Forest. Additionally, there's a really nice paper by the creator of the Random Forest algorithm Freeman who is very well known for his work on decision trees and ensembles. I definitely recommend reading that paper if you're into that kind of stuff.

The second type of decision tree Ensemble, we're going to talk about uses something called boosting so boosting is completely different than bagging.

Deep Dive into Bagging

How Bagging Works

So here we will recursively train decision trees using an error-based re-weighting scheme. No worries if this doesn't make any sense. We're gonna walk through what I mean by this one step at a time.

So again imagine, we start with this training data set T naught but now we're going to introduce this concept of weight and this is exactly what it sounds like essentially. We can give different records in our data set more weight or more importance.

Example of Bagging: Random Forest

When it comes to developing our model, we'll just start with T naught in such a way that all the weights are equal so every record and every example is equally important and then we can use this training data set to create a decision tree.

We'll call it h naught but now we can create another training data set based on the performance of this decision tree. So that might look something like this. So notice the different colors here. All this shows is that records one and four were correctly classified in this binary classification problem. We're trying to solve why records two three and five were incorrectly classified and so what we can do now is we can decrease the weight of Records. One and four and increase the weights of two, three, and five, and then with that, we have this new training data set T1. We can train a new decision tree. We'll call it H1 and then we can repeat this process.

We evaluate the predictions of H1. We see which records were correctly predicted and which records were incorrectly predicted and update their weights.

Accordingly, create another decision tree and so on and so forth and we can do this for however long that we want. Now again, we have a collection of decision trees and we can just aggregate the predictions of these decision trees into a single estimate.

Exploring Boosting Techniques

Concept of Boosting Explained

The first technique that really introduced this idea of boosting I called add a boost or adaptive boosting. So when people are talking about boosting, they're typically talking about a process similar to what we see in Ada boost which is essentially what I walk through here just explaining some of the details a bit more here. So basically all that added a boost. Does it combine each of these decision trees into a linear model and weights each of the decision tree predictions based on this Alpha value and then the alpha value is just proportional to the decision tree's performance.

How Boosting Improves Model Performance

Here what I've written out is the specific re-weighting scheme used to add a boost. So notice that incorrectly classified records will get a weight update proportional to this value. While correctly predicted records will have their weight updated proportional to this Factor. Since Ataboost was introduced in the mid-90s. There have been two major Innovations around this idea of boosting.

Key Boosting Algorithms: AdaBoost, Gradient Boosting, and XGBoo

The first of which is called gradient boosting so instead of talking about the specific re-weighting scheme and details of adding a boost gradient boosting just provides a more generalized framework where you can take any differentiable loss function and Define this gradient and Define some boosting strategy from it.

The second major Innovation comes from a library called xgboost which basically makes the gradient-boosting idea much more scalable and computationally efficient through a set of different heuristics. So that's all I'm going to say about those.

I talk a little bit more about gradient boosting and XG boost in the blog associated with this article. I also have the original references for those ideas in the blog as well.

Benefits of Using Decision Tree Ensembles

Robustness to Overfitting

So now coming back to this question of why are decision tree ensembles better than just single decision trees. If I were to summarize everything into a single picture, it would be something like this. We're essentially going away from point estimates toward population estimates. So what I mean by this is instead of just having a single number as our prediction from our decision tree. We now have a population of predictions from our decision tree Ensemble which has these three main benefits.

I'm going to talk about now so the first key benefit is that decision tree ensembles are much more robust to the overfitting problem than single decision trees. So if you saw the previous article of this series we saw that overfitting is when your machine learning model essentially over-optimizes to a single training data set in such a way that when you try to apply it to new data, it doesn't work as well this turns out to be a pretty big problem for just single decision trees. But this problem for a lot of cases tends to go away when you start aggregating groups of decision trees together.

Enhanced Feature Importance Rankings

The second key benefit of decision tree ensembles is more robust feature importance rankings are a critical output of any decision tree-based method. These can be based on things like Information Gain or out-of-bag error. If we're talking about the random forest or any number of different ways that we want to Define importance.

Some of these quantities that we can use to define importance are only possible through tree Ensemble-based approaches. If one example is of bag error defined in the Random Forest algorithm. I won't get into all the details if you're interested. I talk a little bit about it in the blog but all, that is to say, is that Tree Ensemble approaches not only open up. More ways of defining importance but now kind of going back to this idea of population estimates. We're not just relying on the importance of our features from one view of the data essentially from one decision tree. But through having a wide collection of decision trees our importance rankings can become robust.

Quantifying Prediction Confidence

Finally. the last key benefit of decision tree ensembles is population estimates. We now have a pretty straightforward way to quantify our confidence or uncertainty in our model's predictions anytime you want to use your model in the real world.

There are physical consequences for your model. It's good to have some measure of confidence or uncertainty. So you know your exposure while you just have a point estimate. You don't recall knowing the confidence of your prediction it could be zero uncertainty or it can be infinite uncertainty. So that's another case where population estimates are very beneficial okay.

Practical Example: Breast Cancer Prediction

Setting Up the Example with sklearn

Now we're going to jump into some example code. So here we're going to do breast cancer prediction using decision tree ensembles. Like always, we're going to use the Sklearn python Library which is one of the most popular machine learning python libraries there is. Then the data that we're going to use for this example comes from the UCI machine learning repository.

Code Walkthrough: Implementing Decision Tree Ensembles

The first step is we import our Python libraries so it just kind of runs through quickly. We have pandas to help Wrangle our data and Numpy to do some math. Matplotlib will help us make some nice visualizations of Sklearn data sets. So this is the data set. While it's originally from the UCI machine learning repository. Sklearn has this data set readily available for us and this is my short apology for using a toy data set and not wrangling a data set from The Real World. But the point of this is to focus on the tree ensembles and not the data preparation step.

Next, I imported smote so this is optional I haven't commented on the results. We're going to see here but if you're interested head over to the GitHub uncomment this code block and you'll be able to see what the results are doing here. If you recall we used smote in the previous example to balance our imbalance data set.

Comparing the Performance of Different Models

Now finally, we imported a whole bunch of other things from Sklearn. This handy function creates a training and testing data set decision tree classifier and then we import all the different tree Ensemble approaches that we've talked about it's a random Forest. Add a boost and gradient boosting.

Analyzing Model Performance

Performance Metrics for Decision Tree Ensembles

Then finally, we import three different evaluation metrics for our decision trees. Basically what we're going to do in this example is we're going to train four different models using these four different approaches and we're just going to compare their performance.

Sklearn makes it super easy to import this toy data set, just one line of code we have it in a pandas data frame. Then it's always a good practice to plot the histograms of your data. Here are all the predictor variables. We have at our disposal and then here is our Target variable. So this kind of goes back to the imbalance data set idea. We see that there are a lot more cases where the breast tumor is benign as opposed to malignant. While we could apply smote here to synthetically over-sample the minority class. We're not going to do anything here. See how the four different models hold up okay.

Understanding Overfitting and Model Robustness

Next, we define our predictor and Target variables. This is basically grabbing everything but the last variable name in our data frame, this is grabbing. The very last variable name in the data frame and then this is just creating two data frames based on the variable names. Then with that, we can easily create our training and testing data sets. Here we use the 80, 20 split sklearn makes this super easy. A bit of a warning with this next block of code because I just inherently refuse to copy and paste code over and over again. I use what I've heard referenced as automatic code generation and basically all that's going to happen here. Instead of explicitly writing the Python command out and then pasting and changing one thing copy pasting change.

Further Exploration and Next Steps

Feature Importance Rankings and Their Significance

The next steps are looking at the feature importance rankings for all four models and then additionally doing some kind of uncertainty estimate for each of these models, okay.

One thing and so on you can Define your Python command as a string and then use this handy execute function to execute the command. So while this might conceal kind of like what's going on here. This is just a much cleaner way and more convenient way that I found to write code. I'm sure there are going to be some programmers out there who are going to yell at me for doing this but haven't run into any major issues writing code.

This way so be curious to hear other people's thoughts on this while it might conceal kind of what's going on here. I have everything printed out. So essentially what's being dynamically written here is this single line of code. So all that's happening is four different models are being created using the four different things. We imported from sklearn so the decision tree classifier is our loan decision tree.

Exploring Uncertainty Estimates in Model Predictions

The random Forest classifier uses a random Forest, adds a boost, and then gradient boosting. So all these different models are initialized in this calf data structure and then each of these is stored in a list. So now we have a list of models as you can see here. The automatic code generation gets even worse here because we have a lot of combinatorics happening.

So we have four different models. We have two different data sets and we have three different performance metrics. We want to Define each of these cases. So this code block may not make a whole lot of sense but I printed everything that's being dynamically written here. So let's just look at these first three lines. So all that's happening is we're going one model at a time for the models in our list and we're going to apply it to the training data set and get a prediction.

Then we're going to compute the Precision recall in F1 for this model applied to the training data set. So that's what these three lines of code do here and then we do the same exact thing with the same exact model. But now for the testing data set, we get a prediction compute the Precision recall F1 score and we just append everything to the same list and so on for each model.

The results get stored in this performance dict. It's just a dictionary that we initialized here, where the keys are the different model names and the values are all the performance metrics relevant to that model. Then after all that, this dictionary gets filled build up for all four models for all three evaluation metrics and for both data sets. We can just convert it all to a pandas data frame so if this is all confusing and doesn't make any sense, don't worry about it. It doesn't really matter what matters is this final output.

Here we can simply look at all four of our models and all the different performance metrics that we have and just compare them together. So we can see that all four models performed perfectly on the training data set. So we can see the Precision recall F1 score the training data set is one but the real test is looking at the performance metrics for the testing data set.

Conclusion and Further Reading

Summary of Key Takeaways

So in this context, we can use as a rule of thumb the difference in performance between the training and testing data sets is indicative of overfitting. Put more simply, the smaller these values are the more overfitting that model is showing based on that heuristic. We can see that the decision tree classifier seems to be overfitting most because it has the worst performance.

When applied to the testing data that so on the other side of it random forest and gradient boosting seem to have the best performance when looking at the F1 score closely. Second is at a boost which has an F1 score of zero nine six three. So these results make sense they agree with this story and intuition that tree ensembles are more robust to overfitting than single decision trees alone.

10 Decision Trees are Better Than 1 | Random Forest & AdaBoost - icoversai

Table of Contents: