Decision Tree Tutorial

a tree Decision Processing is a non-parametric supervised learning approach and can be applied to both regression and classification problems. Following the tree analogy, decision trees implement a sequential decision process. From the root node, a feature is evaluated and one of the two nodes (branches) is selected. 

Each node in the tree is essentially a decision rule. This procedure is repeated until a final leaf is reached, which normally represents the target. Decision trees are also attractive models if one cares about interpretability.

decision tree

Common algorithms

There are algorithms for creating decision trees:

  • ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm creates a multidirectional tree, finding for each node (i.e., greedily) the categorical feature that will produce the greatest information gain for the categorical targets. Trees grow to their maximum size, then a pruning step is usually applied to improve the tree's ability to generalize to unseen data.
  • C4.5 was developed in 1993 by Ross Quinlan, is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numeric variables) that divides the attribute value continuous into a discrete set of intervals. C4.5 converts the trained trees (i.e., the output of the ID3 algorithm) into sets of if-then rules. The correctness of each rule is then evaluated to determine the order in which they should be applied. Pruning is performed by removing a rule's precondition if the rule's accuracy improves without it.
  • C5.0 is the latest proprietary licensed version of Quinlan. It uses less memory and creates smaller rule sets than C4.5 while being more precise.
  • CART (classification and regression trees) is very similar to C4.5, but it differs in that it supports numeric target variables (regression) and does not calculate rule sets. CART builds binary trees using the feature and threshold that generate the greatest information gain on each node.

scikit-learn uses an optimized version of the CART algorithm

Terminology

In keeping with the tree analogy, the terminology was adopted from tree terminology.

decision tree

  • Root node: is the first node of decision trees
  • Splitting: is a process of dividing the node into two or more sub-nodes, starting with the root node.
  • Node: Split the results of the root node into sub-nodes and split the sub-nodes into other sub-nodes.
  • Leaf or terminal node: end of a node, since the node can no longer be divided
  • Pruning: is a technique to reduce the size of the decision tree by removing sub-nodes from the decision tree. The goal is to reduce complexity to improve predictive accuracy and avoid overfitting
  • Branch/Sub-Tree: A sub-section of the entire tree is called a branch or sub-tree.
  • Parent and Child Node: A node divided into sub-nodes is called the parent node of the sub-nodes, while the sub-nodes are the child of the parent node.

General operation

Consider the following example where a decision tree predicts a baseball player's salary:

decision tree

We use the Hitters dataset to predict a baseball player's salary (average salary) based on years (the number of years he played in the major leagues) and hits (the number of hits he had the previous year).

Based on the features, the decision tree model learns a series of splitting rules, starting from the top of the tree (root node).

  • The root node is divided into sub-nodes with an observation rule having years <4.5 towards the left branch, meaning players in the dataset with years <4.5 having log salary average is 5.107 and we make a prediction of 5.107 thousand dollars, that is 165,174 $ for these players.
  • Players with years >=4.5 are assigned to the right branch, then this group is subdivided into hits <177.5 with an average salary loss of 6.
  • Players with years >= 4.5 are assigned to the right branch, then this group is subdivided by Hits >= 177.5 with an average salary loss of 6.74.

In this case, we can see that the decision tree forms a three-region segment where this region determines the salaries of baseball players and we can say that the region is a decision boundary.

decision tree

These three regions can be written

  • R1 ={X | Years<4.5 }
  • R2 ={X | Years>=4.5, Hits<117.5 }
  • R3 ={X | Years>=4.5, Clicks>=117.5}.

From this intuition, there is a process of splitting a decision tree to form a region capable of predicting the salary of baseball players.

Fractionation / splitting

In order to split the nodes at the most informative features using the decision algorithm, we start at the root of the tree and split the data on the feature that results in the greatest information gain (IG ). Here, the objective function is to maximize the information gain (IG) at each split, which we define as follows:

splitting decision tree

f is the functionality to perform the division, Dp and Dj are the data set of the parent node, j-th child, I is our measure ofimpurity, Np is the total number of samples at the parent node and Nj is the number of samples. in the j-th child node.

As we can see, the information gain is simply the difference between the impurity of the parent node and the sum of the impurities of the child node: the lower the impurity of the child nodes, the greater the information gain. However, for simplicity and to reduce the combinatorial search space, most libraries (including scikit-learn) implement binary decision trees. This means that each parent node is divided into two child nodes, D-left and D-right.

splitting decision tree

Impurity measure implements binary decision trees and the three impurity measures or splitting criteria that are commonly used in binary decision trees are Gini impurity (IG), entropy (IH) and l classification error (IE).