Introduction to Machine Learning with Decision Trees
A machine learning model is a program that combs through data to learn, find patterns and make predictions. A model is trained with previously unseen data called training data which, when provided with an algorithm, can reason and learn from the data. An example of where this is used is if you want to build an application that can recognize if a voice is male or female. You can train the model by providing it with various voices labeled either male or female; the algorithm will be able to learn the difference in pitch or speech patterns and recognize if a voice is male or female. While there are various models in machine learning, in this tutorial we will begin with one called the Decision tree. Decision trees are the basic building block for some of the best models in data science and they are easy to pick up.
Decision Tree
When it comes to decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision-making. It is one of the predictive modeling approaches used in statistics, data mining, and machine learning. First, we need to understand decision nodes and leaves.
The leaf node gives the final outcome; the decision node is where the data splits. There are two main types of decision trees, these are classification and regression. Classification trees are when the target variable can take a discrete set of values. For these tree structures, the leaves represent class labels and the branches represent the features that lead to those class labels. Regression trees are when the target variable takes continuous values, usually numbers. For simplicity, we will begin with a fairly simple decision tree.
Making the model
Exploring the data
When beginning any machine learning project the first step is familiarizing yourself with the data. For this, we will use the pandas library which is the primary tool data scientists use when exploring and manipulating data. It is used with the import pandas as pd
command below. To follow along, click this link to the jupyter notebook.
Demonstration of the import command.
A vital part of the pandas library is the data frame where data is represented in a tabular format similar to a sheet in Excel or a table in a SQL database. Pandas has powerful methods that will be useful for this kind of data. In this tutorial, we’ll look at a dataset that contains data for housing prices. You can find this dataset in Kaggle.
We will use the pandas function describe() that will give us a summary of statistics from the columns in our dataset. The summary will only be for the columns containing numerical values which are easier to use with most machine learning models. Loading and understanding your data is very important in the process of making a machine learning model.
We load and explore the data with the commands below:
Demonstration of the describe() command.
The summary results for the columns can be understood this way.
Count shows how many rows have no missing values.
The mean is the average.
Std is the standard deviation, which measures how numerically spread out the values are.
To interpret the min, 25%, 50%, 75%, and max values, imagine sorting each column from the lowest to the highest value. The first (smallest) value is the min. If you go a quarter way through the list, you’ll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced “25th percentile”). The 50th and 75th percentiles are defined analogously, and the max is the largest number.
Selecting data for modeling
Datasets sometimes have a lot of variables that make it difficult to get an accurate prediction. We can use our intuition to pare down this overwhelming information by picking only a few of the variables. To choose variables/columns, we’ll need to see a list of all columns in the dataset. That is done with the command below.
Demonstration of the .columns command.
There are many ways to select a subset of your data but we will focus on two approaches for now.
Dot notation, which we use to select the “prediction target”
Selecting with a column list, which we use to select the “features”
Selecting The Prediction Target
We can pull out a variable with dot-notation. This single column is stored in a Series, which is broadly like a DataFrame with only a single column of data. We’ll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called y. So the code we need to save the house prices is:
Demonstration of the variable `y` assignment to housing price.
Choosing Features
The variables/columns chosen to be added to the model and later used to make predictions are what are referred to as features. For this tutorial, it will be the columns that determine home prices. There are times when you may use all your columns as features and other times fewer features are preferred.
Our model will have fewer features. This is done by selecting multiple features by providing a list of column names inside brackets. Each item in that list should be a string (with quotes).
It looks like this:
Demonstration of the house_features list.
Traditionally this data is called X.
Demonstration of the variable `x` assignment to the selected housing features.
You can review the data in the features using the .head()
command:
Demonstration of the .head() command.
The Model
When creating our model we will use the library Scikit-learn which is easily the most popular library for modeling typical data frames. More on Scikit-learn.
There are steps to building and using an effective model, they are as follows:
Defining — There are various types of models other than the decision tree. Picking the right model and the parameters that go with it is key
Train/Fit — this is when patterns are learned from the data.
Predict — the model will make predictions from the patterns it learned when training.
Evaluation — Check the accuracy of the predictions
Below is an example of a decision tree model defined with scikit-learn and fitted with the features and the target variable.
Example of a decision tree model.
The library is written as sklearn. By specifying a number for random_state you ensure you get the same results in each run. Many machine learning models allow some randomness in model training; this is considered a good practice.
We could predict the first few rows of the training data using the predict function.
Demonstration of the .predict() function.
How good is our model?
Measuring the quality of our model is imperative to improving the model. The fitting measure of a model’s quality is in its predictive accuracy. There are many metrics used for summarizing model quality but we’ll begin with Mean Absolute Error. In Mean Absolute error (MAE for short) the absolute value of the error is converted to a positive number and the average is taken. Simply put, on average our predictions are off by this value.
This is how to calculate the mean absolute error.
Calculating the mean absolute error.
What we calculated above is called the “in-sample” score. We have used the same sample of data for house prices to build and evaluate our model. Since the patterns learned were from the training data so it seems accurate in the training data. This is bad because those patterns derived won’t hold when new data is introduced. Because models’ practical value comes from making predictions on new data, we measure performance on data that wasn’t used to build the model. How to do this is to exclude some data from the model-building process, and then use those to test the model’s accuracy on data it hasn’t seen before. This data is called validation data.
The scikit-learn library has the function train_test_split to break up the data into two pieces. We’ll use some of that data as training data to fit the model, and we’ll use the other data as validation data to calculate mean_absolute_error. Here is the code:
Calculating mean_absolute_error.
This is the difference between a model that is almost exactly right and one that is unusable for most practical purposes.
Overfitting and Underfitting
Overfitting refers to when a model takes too well to training data. This means that the inaccuracies and random fluctuations are picked up as patterns and concepts by the models. The problem with this is that these patterns do not apply to new data and this negatively impacts the models’ ability to generalize.
Overfitting can be reduced by:
Increasing the training data
Reducing the models’ complexity
Using a resampling technique to estimate model accuracy.
Holding back validation data
Limiting the depth of the decision tree with parameters(see below)
There are various ways to control tree depth but here we will look at the max_leaf_nodes argument. This will control the depth of our tree. Below we will create a function to compare MAE scores from different values for max_leaf_nodes:
Function to compare MAE scores.
We can follow this by using a for-loop to compare the accuracy of the model with different values for max_leaf_nodes.
for-loop to compare the accuracy of the model.
50 nodes are the most optimal since they have the least MAE score.
Underfitting is when a model cannot learn patterns from data. When this happens it’s because the model or algorithm does not fit the data well. It happens usually when you do not have enough data to build an accurate model.
Underfitting can be reduced by:
Increasing the models’ complexity
Increasing the number of features
Increase the duration of training
While both overfitting and underfitting can lead to poor models, performance overfitting is the more recurring problem.
Conclusion
There are many ways to improve this model, such as experimenting to find better features or different model types. A random forest algorithm would also suit this model well as it has better predictive accuracy than a single decision tree and it works well with default parameters. As you keep modeling you will learn to use even more sophisticated models and the parameters that go with them.
Follow me here for more AI, Machine Learning, and Data Science tutorials to come!
References
https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052
https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/