The Gini Index

“The concept of a criterion depending on a node impurity measure has already been introduced. Given a node t with estimated class probabilities p(j|t), j=1, …, J, a measure of node impurity given t:

i(t) = psi(p(1|t), …, p(J|t))

is defined and a search made for the split that reduces node, or equivalently tree, impurity. As remarked earlier, the original function selected was:

psi(p1, …, pJ) = -Sum(j)(pj * log(pj)).

“In later work the Gini diversity index was adopted. This has the form:

Sum (j!=i) (p(i|t)p(j|t)).

“The Gini index has an interesting interpretation. Instead of using the plurality rule to classify objects in a node t, use the rule that assigns an object selected at random from the node to class i with probability p(i|t). The estimated probability that the item is actually in class j is p(j|t). Therefore, the estimated probability of misclassification under this rule is the Gini index:

Sum (j!=i) (p(i|t)p(j|t)).

“Another interpretation is in terms of variances (see Light and Margolin, 1971). In a node t, assign all class j objects the value 1, and all other objects the value 0. Then the sample variance of these values is p(j|t)(1-p(j|t)). If this is repeated for all J classes and the variances summed, the results is:

Sum(j) (p(j|t)(1-p(j|t)) = 1 – Sum(j) (p^2(j|t))).

The Gini index is simple and quickly computed. It can also incorporate symmetric variable missclassification costs in a nature way.” (pp. 103-104)

Leave a Reply