A decision tree is an efficient non-parametric method for supervised learning. It includes a hierarchical data structure with a divide-and-conquer algorithm, where a problem is recursively broken down into two or more sub-problems of the related type, until all sub-problems are simple enough to be solved directly.

Generally, decision trees have the following advantages over many other methods of machine learning:

Versatility:

: they can be used for a wide range of machine learning tasks, such as classification, regression, clustering and feature selection.

Self-explanatory:

: they are easy to follow and understand.

Flexibility:

: they are able to handle different types of input data (e.g. nominal, numeric and textual).

Adaptability:

: they can process datasets containing errors or missing values.

High predictive performance:

: with a relatively small computational effort they usually perform very well.

The decision tree is a directed tree with a node called a "root (i.e. it has no incoming edges). All other nodes have exactly one incoming edge and zero or more outgoing edges. A node with outgoing edge is called "internal" or "test node. A node without outgoing edge is called "leaf" (also known as "terminal or "decision node).

The decision tree partitions the instance space according to the features value. For this purpose it chooses one feature of the data at each node that most effectively splits the corresponding instance space. There are different measures that can be used for splitting the instance space. Two frequently used measures are Gini index and information gain, which are explained as follows:

Gini index:

: Gini index for a given feature \(f\) is:

```
$$GINI(f) = 1 - \sum_{C}(p(C|f))^2 \label{a:eq:GINI}$$
where \\(p(C|f)\\) is the relative frequency of class \\(C\\) and feature \\(f\\). The Gini index is minimum (i.e. \\(0.0\\)) when the feature corresponds to only one class, implying the most interesting information. The Gini index reaches its maximum (i.e. \\(1-\frac{1}{|C|}\\)) when the feature is equally distributed among all classes, implying the least interesting information.
The Gini index cannot measure the quality of split of each feature (i.e. at a node of tree). A metric that measures this quality is \\(GINI_{split}\\), which is defined as: $$GINI_{split} = \sum_{i=1}^k \frac{n_i}{n_j}GINI(i) \label{a:eq:GINISPLIT}$$ where \\(n_i\\) is the number of instances at child node \\(i\\), and \\(n_j\\) is the number of instances at the parent node.
```

Information gain:

: The change in entropy from a prior state to a state that takes some information is the expected information gain: $$IG\left(f\right) = H\left(C\right) - H_{f}\left(C\right) \label{a:eq:IG}$$ where \(f\) is the feature value, \(C\) its corresponding class, and entropy is defined as follows: $$H\left(C\right) = - \sum_{i} P\left(C_{i}\right)log_{2}P\left(C_{i}\right)$$

$$H_{f}(C) = \sum_{f} \frac{|C_{f}|}{|C|}H\left(C_{f}\right)$$ As for the Gini index, if a feature takes a large number of distinct values, the information gain would not be a good measure of the quality of split. In such cases the information gain ratio is used instead. The information gain ratio for a feature is calculated as follows: $$IGR\left(f\right) = \frac{IG\left(f\right)}{SInf\left(C\right)} \label{a:eq:IGR}$$

$$SInf(C) = - \sum_{i} \frac{|C_{i}|}{|C|}log_{2}\frac{|C_{i}|}{|C|}$$