How to Do Data Cleaning (step-by-step tutorial on real-life dataset)

Learn the essential steps of data cleaning, from handling missing values and outliers to filtering columns and standardizing data. This guide offers practical tips and techniques for effective data preprocessing in data science.

Introduction to Data Cleaning: A Crucial Step in Data Science

Why Data Cleaning Matters in Every Project

Hey, welcome back. Now it's time to look into data cleaning. I hope you're excited to fix all the problems we discovered with the data in the previous article on data exploration. If you haven't read that one, I highly suggest you go read that one. I'll leave the link somewhere here. If you're excited about learning more from me. If you're excited about getting more training from me on data science also go check out my course hands-on data science.

Again, I will leave the link in the description. So that you can learn more about it. But for now, let's get started. Exploring the data is actually the real deal. I hope I gave you a good impression of how it normally is.

How much does it take to loom because you need to start from scratch? Especially when you're working with a new data set that you've never seen before you need to take some time to make sure that you are considering a lot of different options.

Data Exploration Recap: Identifying Problems for Cleaning

Review: How Data Exploration Unveils Data Issues

You need to make sure that you're looking into what each thing is and make sure that you are taking the right steps. But once we are done with exploration, it's actually pretty simple to clean the data. Because at this point, we've already seen some issues with the data.

Link to Previous Post: Data Exploration Techniques

We've already explored and we kind of understood it a little bit better. So from this point on, we just need to take the actions that the exploration steps showed us that we need.

First Step in Data Cleaning: Removing Irrelevant Columns

Filtering Columns: Why Some Data Points Don’t Matter

You might remember that we already filtered out some of the columns from our data set. First, we had some columns about the location of the trees and we filtered them out this can also be counted as cleaning. Because we realize that we do not need these columns. We just kind of removed them from the working data set. So that's one thing.

Handling Non-Standard Categorical Values in Your Dataset

Standardizing Categorical Values for Cleaner Data

So for example, whether it has a steward or not how many stewards it has, or the curb location of the tree. So these are the kinds of things that we would decide to change or not.

Example: Fixing Inconsistent Categorical Data

For example, as I mentioned also before. If instead of one or two, we had values like this, or if you had values like this.

Another thing, that we've seen with this data set is that there are some non-values and we realize these num values mostly happen with the stump or dead tree. So basically, trees that have been cut off or trees that are just dead, not alive anymore.

Dealing with Missing Data: How to Handle NaN Values

Understanding Missing Values in Your Dataset

Let's say, this is again the data set that you want to use. Then you're going to say in this data set, I do not want the values data points where the status is set to stump and you can do the same with the dead one. So then you'll be effectively removing these data points from your data set.

I'm going to change them from none to not applicable because you know they're stumps or dead. So health does not really apply to them. So if we look into it now. So basically, the options for the health column are either good fair, or poor right. But that does not apply to a stump, because it's just like kind of a cut to the base of the tree.

Replacing NaN with 'Not Applicable' for Dead or Removed Trees

What I'm going to do is add a new value here. I'm going to say it is not applicable and what comes to my rescue. Of course, to do this action is my trustable filter function. So let's do this here. I mostly want to work with the tree sensor subset data set. So maybe, let's even go all the way down. We can work with that here. We have the things that our stumps. We have the things that are dead trees right. I can just say fill now with something here right, but as you might remember the sum of this number and this number.

So basically, all the number of trees are either stumps or dead, it does not equal the number of missing values that we have. They're like one or two missing. Once there could be like if I remember correctly. I think the total number of that was 31, 615 or something like that. So if I fill all the non-values, I will be filling some non-values that are not either stumps or dead trees.

There could be some other non-values that I'm not covering yet. So that's why I want to be really precise and only fill in the non-values where a tree is either dead or a stump.

Using Masks to Accurately Clean Your Dataset

What Are Masks and How Do They Help in Data Cleaning?

I want to go and see kind of like going back to the exploration seeing, what other non-values are there? So what I'm going to do for that is basically create a mask and you'll understand what this is in a second. Basically, this mask is you'll see, how I'm using it here? I'm going to say again this nope not this. The data set gives me all the locations.

So this is like a specific way of choosing a certain cell. Basically by using the mask, I'm choosing certain rows. I'm saying give me the rows where the status is equal to stump or dead. I will also be specifying the column that I want to update.

Implementing a Mask to Handle Specific Data Issues

Now, I'm going to put a mask on here. The first one specifies the rows and the second one specifies the column. The column I want health and I want to change these ones. What the same ones, but filled with not applicable. As a result, I want to see this one again to see if it worked or not. So it gives me a little warning here that what I'm using is not perfect. But I feel like I'm using the exact same thing, so I don't really understand these warnings. But this happens once in a while, just make sure that you're not using something that's about to be deprecated.

You use it in a couple of weeks. Your code might not work anymore and most of the time. They will give you an option of what you should use. Here it says to try using dot lock rov indexer and column index that's exactly what I'm using. So I don't really understand, why there's a problem? Anyways, let's see, it looks like it worked.

Advanced-Data Cleaning: Handling Outliers and Extremes

Identifying Outliers in Your Dataset: A Step-by-Step Guide

Now actually, you know instead of going one by one here. I feel like maybe if I don't specify the column, it might still work. So let me try this. It worked, okay. So now I filled it. Basically, I didn't specify a column and then it looked at all the columns, and since I know these are the only columns where there are no values. Because I've done is not dot sum before as in like this one. Let me show you, I said show me all the counts of missing values on all the columns and it showed me that these are the only ones.

So these are the only ones it could have affected. So this is good to know, I still want to see if there are any other missing values. Now, that we fixed this interesting, yes! so there are 49 missing values still here. One here, five here, and one on the sidewalk.

Techniques to Clean Outliers: Removal vs. Modification

Let's look into those and see what's up with them, okay. This isn't a live tree. It's just that health is none. So it looks like the value is just missing. It's kind of hard to know what to do with this. I think since it's just like one value. I think it's not unreasonable to remove it. Because then you'll also be removing the steward interest. It says steward is not missing, but there is actually a missing value here. So let me look at the steward values.

In this case, I think none here it's not like the nan that we've seen before like not a number. I think none here means, it just does not exist there. There are no problems or there is no steward, okay. That's not the one that we're looking for. It's not the missing value. Health is nan though, so it's really hard to decide than on that case. I would probably just remove this one. So let's look at the sidewalk. One I mean probably the same to be honest alive held a good sidewalk nan and the value for sidewalk is no damage.

Damage not applicable but created not applicable. So I would probably actually, since these are just like one value and they cannot affect the outcome too much. We might want to keep the other values like you know I might want to know the actual count of how many live trees. There I actually changed my mind about the previous missing value too. I would just fill it with whatever is the most common value here.

I think no damage is the most common value. So instead of none here. I'll just have that one. So let's also look at this one. This is a little bit trickier because there are five of them and also this is like it's not like. If the tree is healthy. If there are problems with it or not this is like the type of tree. So instead of the type of tree here, I think I might put like ren. I'll just say like couldn't observe or something like that and I'll fill it, but like couldn't observe.

We'll look at those ones and then let's see the problems, okay. These are all the live trees and problems for some of these are none. It wasn't entered and normally the problem one is, you know what I think it is fair at this point to feel this nun at the problems with none this one. So instead of like not a number, I think we can fill it with just none because you know these are the problems that there would be with like the root or the trunk or the branch and these are all nose. So I think it's fair to feel these with just none.

Let me do all of these things. I'll meet you back here in a couple of seconds. I quickly checked what the most common values are for problems. I'm sorry no problems with health and the sidewalk just using the value counts and looks like the most common value is for sidewalks with no damage. So that's what I assigned to the missing value for health. The most common value was good, but as I said these are not going to be like valid data points. But it's like one data point out of like what was that like 60 000 or something like that.

I think it just makes sense to like to keep the other values in our data set. I think it's worth it. We're kind of lying a little bit like a white lie that will help people's problems as we observed it looks like there were no problems. So we can easily put none. I think that is accurate and for the Latin, I'm just creating a new value.

Instead of the name of the tree, I'm just going to say no observation. You know it's just something similar. So if I try to see how many missing values there are again just to keep the kind of thing running here, let me put it here. No more missing value. So that's great finally we cleaned our data from the missing values.

Using Grouping to Analyze and Clean Tree Diameter Data

Grouping by Tree Type for Diameter Analysis

There is one last thing that we need to do and that is to deal with the outliers that we figured out. Before we explored the data those are the ones that we saw from the diameter. So let's look at those. We already said these are the big trees. I think I'm also going to include the other values like. So either the stump diameter is higher than 50. We said the tree diameter is bigger than 60. I'll just go check it again, but I think it was something like this right.

Let's keep both on 60. so this is the tree. So now, we only have 245 out of a total. By the way, as you can see here, you can see the number of rows and columns, 683 thousand nearly 684 thousand. So not a lot of outliers. One thing you can do is just to say, you know what I'm not interested in these and just like cut them out. That's a totally acceptable way of doing it, but another way that we can do this I think could be like.

An interesting exercise is to see what type of trees. These are and either get the average value from this of diameter for these three types from the original data set or like the maximum value maybe. Change the diameter here with that, so that's one thing we can do for that. What I want to do is actually first remove those trees from the original data set that are a little bit too ex too big. So I'm going to say trees need to be either equal to the tree diameter needs to be equal to or less than 60.

Calculating Mean and Maximum Tree Diameters

Again here also the stump diameter needs to be equal to or less than 60. I am filtering the original data set, so the reason that I'm doing this is because if I want to calculate. The mean centimeter for the trees and the mean diameter for the trees. I don't want to have the outlier information there too, because it's going to affect the mean that I'm calculating. So let's do that, okay.

I also want to have the alive and that tree separate or like a stump and that tree separate because as you can see if a tree is alive, then the stump diameter is zero which is actually not the case. They just didn't calculate it, because it's not a stamp. It's an actual tree, so I want to have a separate data set for stump or dead or live trees, okay. Now, that I have those, I am going to group them by the tree type, okay.

This gives me for each three species. The stump diameter and three wells which is zero because these are only for the live trees. The tree diameter, the tree idea as I said again. I should have changed it to string and all the way at the beginning. I think that would have been useful. So it doesn't keep popping up. So these are the values. This is just the mean.

Advanced Techniques: Adjusting Outliers Using Percentiles

Why Adjusting Extreme Values Based on Percentiles Can Be Useful

If you want, you can also use the maximum. But or the minimum or 25th percentile value, I guess this will also give me all the information when it comes to tree maybe.

I can even specify what I want, okay. That's better. So for each one the mean standard deviation which is not very high well, I mean in the examples. That I'm seeing which is good. The minimum value, it's a bit weird. That it's zero, to be honest, and maximum value. I want to see the ones that are zero which feel like shouldn't be happening.

I feel like we are now kind of trying to fix something that we explored. Before we actually found something new that is interesting. So I think at this point, we have a crossroads in front of us. I have two options, either I'll just go forward with only changing the values that look a little bit too high and the values that look extremely low like zero, which doesn't exist.

Implementing Upper and Lower Limits for Diameter Data

I can say you know what actually everything bigger than the 75th percentile value of this tree is too big. So I want to kind of like shrink it down. So if the maximum value is 35 that's actually that's probably mistaken. It should be lower and it should be nine and everything lower than four is probably a mistake. So I'm going to bring it all the way to four. So I'm going to show you the second one just because it's a little bit more fun.

I will take this list as a reference and I'm not going to do it for the dead trees. Because if you remember the names of the trees were not even written. So we said like no not observed or something like that. So we don't need to do that. So just need to do this for the live trees. How I'm going to do that is I'm going to actually merge. I can delete this one. I'm going to merge this new table that I created with the one that we already have.

We have the diameter of the tree and we have the minimum and the maximum diameter of the tree. So what I'm going to do now is to compare it to the minimum and the maximum. If it's lower or higher than that I'm going to bring it to that limit. So for the lower one, I've already done it. So this is kind of like, we can make this a mask too oops. You know

I'm just saying this is the condition and when this condition holds true the diameter is lower than the lower limit. Then make the tree diameter equal to the lower limit. So I'll do the same thing for the higher limit too. So you know, you can see here the lower the tree diameter was three but actually the lower the lowest.

We decided it could be five then I changed it and now this value is five. So I kind of do the same thing for the other one and let's see now, was there any that was too big here. Maybe this one. So this one was 21, but we said the highest. It could be 12. This is the fourth row and now it's 12. So it looks like we did a good job cleaning it. So I've already cleaned the alive trees if you want.

You can also clean those trees. You can say, oh no! But like even, if they're dead, they cannot be zero. You can make them higher but the problem is we don't have the Latin names. Let the species names for those trees. So one option could be to delete the ones where the values are zero for the dead ones you can just get an average number for the trees. If you really want to keep those data values data points and then keep them in a database.

Automating Data Cleaning Processes for Efficiency

When to Automate Data Cleaning and When to Manually Fix Errors

one thing that I want to mention is that if you do not have a lot of problematic data points where they're like outliers and stuff you can actually even pay special attention to it if that data point is really valuable to you let's say we only had one outlier like you know everyone everything was more or less on the same level and then you had one that was like 500 inches then you can look into it maybe okay then you can think normally this tree's diameter is between let's say 25 and 55 so maybe this person was trying to type in 50 and then they accidentally put 500 and then change to 50. so if data points are very important to you and you um want to fix if there are not a lot of problematic data points let's say they're like three or five or something like that then you can pay special attention to them and then go one by one and find the most reasonable solution

Conclusion: The Data Scientist’s Approach to Data Cleaning

Key Takeaways on Effective Data Cleaning Techniques

This has been a really long-winded way of explaining, how we can do data cleaning? But these are basically all the steps that you can do. But when you have hundreds of them like we did here and we generally have a lot of data points, they're not necessarily extremely important to us. I think that the best way is to automate them. If you listen to me the whole way, you came all the way here really congratulations for keeping up with this.

The Iterative Nature of Data Science: Discovering and Solving Problems as You Go

Just wanted to give you with this article an explanation of you know or just a feeling of how a data scientist approaches these things. Because as you saw, we basically like you know I am also seeing this data set for the first time. So there were things that popped up while I was doing some analysis and then you know exactly how you are supposed to approach these things.

You can't know everything from the beginning. You basically create things as you go. You basically come up with solutions as you go. So let's end this article here. I hope you learned a lot and yeah I hope to see you around.

How to Do Data Cleaning (step-by-step tutorial on real-life dataset) - icoversai

Table of Contents: