Information gain in decision trees

In information theory and machine learning, information gain is a synonym for Kullback–Leibler divergence; the amount of information gained about a random variable or signal from observing another random variable. However, in the context of decision trees, the term is sometimes used synonymously with mutual information, which is the conditional expected value of the Kullback–Leibler divergence of the univariate probability distribution of one variable from the conditional distribution of this variable given the other one.

The information gain of a random variable X obtained from an observation of a random variable A taking value ${\displaystyle A=a}$ is defined

${\displaystyle IG_{X,A}{(X,a)}=D_{\text{KL}}{\left(P_{X}{(x|a)}\|P_{X}{(x|I)}\right)},}$
the Kullback–Leibler divergence of the prior distribution ${\displaystyle P_{X}{(x|I)}}$ for x from the posterior distribution ${\displaystyle P_{X|A}{(x|a)}}$ for x given a.

The expected value of the information gain is the mutual information ${\displaystyle I(X;A)}$ of X and A – i.e. the reduction in the entropy of X achieved by learning the state of the random variable A.

In machine learning, this concept can be used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X. Such a sequence (which depends on the outcome of the investigation of previous attributes at each stage) is called a decision tree and applied in the area of machine learning known as decision tree learning. Usually an attribute with high mutual information should be preferred to other attributes.[why?]

General definition

In general terms, the expected information gain is the change in information entropy Η from a prior state to a state that takes some information as given:

${\displaystyle IG(T,a)=\mathrm {H} {(T)}-\mathrm {H} {(T|a)},}$

where ${\displaystyle \mathrm {H} {(T|a)}}$  is the conditional entropy of ${\displaystyle T}$  given the value of attribute ${\displaystyle a}$ .

Formal definition

Let ${\displaystyle T}$  denote a set of training examples, each of the form ${\displaystyle ({\textbf {x}},y)=(x_{1},x_{2},x_{3},...,x_{k},y)}$  where ${\displaystyle x_{a}\in vals(a)}$  is the value of the ${\displaystyle a^{\text{th}}}$  attribute or feature of example ${\displaystyle {\textbf {x}}}$  and y is the corresponding class label. The information gain for an attribute ${\displaystyle a}$  is defined in terms of Shannon entropy ${\displaystyle \mathrm {H} (-)}$  as follows. For a value ${\displaystyle v}$  taken by attribute ${\displaystyle a}$ , let

${\displaystyle S_{a}{(v)}=\{{\textbf {x}}\in T|x_{a}=v\}}$

be defined as the set of training inputs of ${\displaystyle T}$  for which attribute ${\displaystyle a}$  is equal to ${\displaystyle v}$ . Then the information gain of ${\displaystyle T}$  for attribute ${\displaystyle a}$  is the difference between the a priori Shannon entropy ${\displaystyle \mathrm {H} (T)}$  of the training set and the conditional entropy ${\displaystyle \mathrm {H} {(T|a)}}$ .

${\displaystyle \mathrm {H} (T|a)=\sum _{v\in vals(a)}{{\frac {|S_{a}{(v)}|}{|T|}}\cdot \mathrm {H} \left(S_{a}{\left(v\right)}\right)}.}$

${\displaystyle IG(T,a)=\mathrm {H} (T)-\mathrm {H} (T|a)}$

The mutual information is equal to the total entropy for an attribute if for each of the attribute values a unique classification can be made for the result attribute. In this case, the relative entropies subtracted from the total entropy are 0. In particular, the values ${\displaystyle v\in vals(a)}$  defines a partition of the training set data ${\displaystyle T}$  into mutually exclusive and all-inclusive subsets, inducing a categorical probability distribution ${\textstyle P_{a}{(v)}}$  on the values ${\textstyle v\in vals(a)}$  of attribute ${\displaystyle a}$ . The distribution is given ${\textstyle P_{a}{(v)}:={\frac {|S_{a}{(v)}|}{|T|}}}$ . In this representation, the information gain of ${\displaystyle T}$  given ${\displaystyle a}$  can be defined as the difference between the unconditional Shannon entropy of ${\displaystyle T}$  and the expected entropy of ${\displaystyle T}$  conditioned on ${\displaystyle a}$ , where the expectation value is taken with respect to the induced distribution on the values of ${\displaystyle a}$ .

{\displaystyle {\begin{alignedat}{2}IG(T,a)&=\mathrm {H} (T)-\sum _{v\in vals(a)}{P_{a}{(v)}\mathrm {H} \left(S_{a}{(v)}\right)}\\&=\mathrm {H} (T)-\mathbb {E} _{P_{a}}{\left[\mathrm {H} {(S_{a}{(v)})}\right]}\\&=\mathrm {H} (T)-\mathrm {H} {(T|a)}.\end{alignedat}}}

Drawbacks

Although information gain is usually a good measure for deciding the relevance of an attribute, it is not perfect. A notable problem occurs when information gain is applied to attributes that can take on a large number of distinct values. For example, suppose that one is building a decision tree for some data describing the customers of a business. Information gain is often used to decide which of the attributes are the most relevant, so they can be tested near the root of the tree. One of the input attributes might be the customer's credit card number. This attribute has a high mutual information, because it uniquely identifies each customer, but we do not want to include it in the decision tree: deciding how to treat a customer based on their credit card number is unlikely to generalize to customers we haven't seen before (overfitting).

To counter this problem, Ross Quinlan proposed to instead choose the attribute with highest information gain ratio from among the attributes whose information gain is average or higher.[1] This biases the decision tree against considering attributes with a large number of distinct values, while not giving an unfair advantage to attributes with very low information value, as the information value is higher or equal to the information gain.[2]

Example

Let’s use this table as a dataset and use information gain to classify if a patient is sick with a disease. Patients classified as True(T) are sick and patients classified as False(F) are not sick. We are currently at the root node of the tree and must consider all possible splits using the data.

Training Dataset
Patient Symptom A Symptom B Symptom C Classification
1 T T T F
2 T F T T
3 F F T T
4 F T T F
5 F T F T

Candidate Splits are determined by looking at each variable that makes up a patient and what its states can be. In this example all symptoms can either be True(T) or False(F).

Candidate Splits
Split Child Nodes
1 Symptom A = T, Symptom A = F
2 Symptom B = T, Symptom B = F
3 Symptom C = T, Symptom C = F

Now for split #1, we determine the entropy before the split which is found using the classification of each patient.

${\displaystyle H(T)=-3/5\log _{2}(3/5)-2/5\log _{2}(2/5)=0.971}$

The conditional entropy of split #1 is determined by finding the entropy of each state of symptom A and combining them.

${\displaystyle H(T|a)=2/5(-1/2\log _{2}(1/2)-1/2\log _{2}(1/2))+3/5(-2/3\log _{2}(2/3)-1/3\log _{2}(1/3))=2/5(1)+3/5(0.918)=0.951}$

Information gain can then be determined by finding the difference in the prior entropy and the conditional entropy.

${\displaystyle IG(T,a)=H(T)-H(T|a)=0.971-0.951=0.02}$

Example of Splitting The Root Node

These steps are repeated for all candidate splits to get their information gain. All candidate splits for a node use the same value for ${\displaystyle H(T)}$ .

Candidate Split Information Gains
Split Information Gain
1 0.020
2 0.419
3 0.171

Candidate Split #2 has the highest information gain, so it will be the most favorable split for the root node. Depending on the confidence of the child nodes classifications, information gain can be applied to the child nodes but cannot use the same candidate split.