In this blog, we are going to introduce a powerful and complex machine learning algorithm, random forest, an extension from decision tree. You are required to understand the concepts of decision tree before reading this blog. Click here to study decision tree if you have not. We will explain the concepts behind the algorithm, followed by a complete Python code for applying random forest to a real-world regression example.

Concepts

Structure

As you can see in our example in decision tree, tree models are prone to overfit the training data. Random forest is one of the way to address decision tree overfitting and instability. It is based on two concepts, bagging and subspace sampling.

Bagging (boostrap aggregation) is creating a dataset that drawn from the original dataset with replacement while subspace sampling is creating a subset of features from the original dataset. In others words, bagging controls the number of observations and subspace sampling controls the number of features to input into a decision tree. Random forest combines the prediction of multiple decision trees and average the prediction value from all of the trees.

The reason why we implement bagging with replacement is that we will get the exact same dataset every time if we eliminate replacement feature. The replacement feature enable us to have a variety of observations which lowers the bias and variance. On the other hand, the reason why we limit the number of features by subspace sampling is that we can avoid dominate feature serving as the top decision node in every tree which generate almost the same predictions among the trees.

Hyperparameters

Maximum tree depth, minimum leaf size, number of trees, number of features and sample size are the hyperparameters we can tune to optimize the performance. Regarding to the features of maximum tree depth and minimum leaf size, please visit decision tree blog [LINK].

Number of trees aims to control the number of tress that we create and average their corresponding value. The more trees we build, the more complex the model will become. However, we see there is a diminishing return on lowering the error in terms of growing trees size. Number of features is the hyperparameter to control the features that we select in every tree (subspace sampling) while sample size is the hyperparameter to control the number of observations (bagging).

In order to find the optimal hyperparameters, we spilt the dataset into three parts, train, validate and test. We build the regression forest that is based on train set, select the optimal hyperparameter through validate set and perform prediction in test set. The reason why we cannot select the hyperparameter in train set is that it tends to overfit as the value of maximum tree depth grows.

We will illustrate how to implement in a real-world dataset.

Example

Dataset

Please refer linear regression page for dataset information: LINK

Let's load the dataset!

from matplotlib import pyplot as plt
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import seaborn as sns
import pandas as pd
import numpy as np

df = load_boston(return_X_y=False)
x_train, x_test, y_train, y_test = train_test_split(df.data, df.target, 
test_size=0.2, random_state=1)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, 
test_size=0.25, random_state=1)

Analysis

Please refer linear regression page for brief analysis: LINK

Model Fitting