In previous blog [LINK], we discussed how to implement decision tree to solve regression problem. Today, we introduce another decision tree algorithm to solve classification problem, classification tree. Similar framework rolls out, we will firstly walk through the mathematics behind classification tree and following by a complete Python code for applying classification tree to a real-world example.

Mathematics

Structure

Please refer regression tree page for structure: LINK

Approach

Gini index is one of the mathematical approach to spilt data in decision nodes for classification trees. It uses proportion mix as a measure for deciding the feature on which decision node is spilt.

$Gini = 1 - \sum_{i=1}^n (p_i)^2$

Here are the steps to use reduction in variance:

  1. On the decision node, try one feature and one value as condition to spilt data into left and right node
  2. Calculate the weighted gini index of each node and combine them together
  3. Loop all the feature and value combination to select the one that has lowest weighted gini index
  4. Loop step 1-3 until either it reached pre-selected maximum tree depth or pre-selected minimum leaf size

Hyperparameters

Please refer regression tree page for structure: LINK

Example

Dataset

In this example we will use data from scikit learn package

Download from scikit learn package: Link

This dataset contains information about difference iris plants.

Input variables