Why Entropy used for making decisions in Decision Tree??

6 min readJun 16, 2021

If you are a beginner in machine learning then this doubt must have arrived to you that “why entropy is crucial for making a decision in DT???”. Lets dive into the topic.

What is Entropy?

In Machine Learning, Entropy is a measure to calculate the impurity of the group. Entropy is used in tree algorithms such as Decision tree to decide where to split the data. Entropy helps to check the homogeneity of the data. Entropy is calculated using the following formula:

where pi is the probability of ith class.

Entropy which is being used in machine learning was derived from Information theory. In information theory, the entropy of a random variable is the average level of “information”, “surprise”, or “uncertainty” inherent in the variable’s possible outcomes. Information entropy is analogous to the entropy in statistical thermodynamics.

As we know entropy is measure of disorder or uncertainty i.e. entropy is an indicator how messy your data is.

Lets try to understand with an example, Assume you are walking into an untidy room and you look at the things and estimate how much untidy or messy it is (not necessarily same for everyone). The order of tidiness might be different for different persons and you know that the items must be in the shelves and probably similar objects grouped together(books with books, clothes with clothes etc)

Fortunately, the visual inspection can be replaced by a more mathematical approach for the data. A mathematical function exists for estimating the mess among mathematical objects and we can apply it to our data. The requirement of this function is that it provides a minimum value if there is the same kind of objects in the set and a maximal value if there is a uniform mixing of objects with different labels (or categories) in the set.

Why entropy in decision tree?

In decision trees, the goal is to clean the data. You try to separate your data and group the samples together in the classes they belong to. You know their label since you construct the trees from the training set. You maximize the purity of the groups as much as possible each time you create a new node of the tree (meaning you cut your set in two). Of course, at the end of the tree, you want to have a clear answer.

The above figure depicts splitting process. we have set of green and purple circles. The decision starts with calculating the feature values inside the initial set. Based on their values, they are splitted into two sets. In this example, after the splitting , the set looks tidier as most of the green circles are in Set1 and most of the purple circles are in Set2.

So decision trees are here to tidy the dataset by looking at the values of the feature vector associated with each data point. Based on the values of each feature, decisions are made that eventually leads to a leaf and an answer.

At each step, each branching, you want to decrease the entropy, so this quantity is computed before the cut and after the cut. If it decreases, the split is validated and we can proceed to the next step, otherwise, we must try to split with another feature or stop this branch.

Before and after the decision, the sets are different and have different sizes. Still, entropy can be compared between these sets, using a weighted sum, as we will see in the next section.

Mathematics behind entropy

Let us assume we have N items and these items fall into two categories, n have label1 and m=N-n items have label2. Lets introduce the probabilities of two labels:

The entropy of the set is given by the following equation:

A set or node is clean if it contains similar items(i.e. items with same labels), and messy if it contains items of different labels. When there is no item with label 1 in the set (p=0) or if the set is full of items with Label 1 (p=1), the entropy is zero. If you have half with Label 1, half with Label 2 (p=1/2), the entropy is maximal (equals to 1 since it is the log base 2).

This function quantifies the messiness of the data.

How entropy is evaluated?

In decision trees, at each branching, the input set is split in 2. Let us understand how you compare entropy before and after the split. Imagine you start with a messy set with entropy one (half/half, p=q). In the worst case, it could be split into 2 messy sets where half of the items are labeled 1 and the other half have Label 2 in each set. Hence the entropy of each of the two resulting sets is 1. In this scenario, the messiness has not changed and we would like to have the same entropy before and after the split. We can not just sum the entropies of the two sets. A solution, often used in mathematics, is to compute the mean entropy of the two sets. In this case, the mean is one. However, in decision trees, a weighted sum of entropies is computed instead (weighted by the size of the two subsets):

where N1 and N2 are the number of items of each sets after the split and E1and E2are their respective entropy. It gives more importance to the set which is larger (if any). The idea is that it is a bit better if the large set gets tidier, as it requires more efforts to tidy. Imagine the worst case where a set of 1000 elements is split in two, with a set of 999 elements and a set of 1 element. The latter set has an entropy of zero since it contains only one element, one label. But this is not really important as the vast majority of the data is still messy in the larger set. So the two sets should be given importance relative to their size.

If you have more than 2 labels, you can generalize the Entropy formula as follows:

where the pi are the ratios of elements of each label in the set. It is quite straightforward!

Conclusion

We have seen that entropy is not just a mathematical formula. It has a simple interpretation that everyone can understand. If you now see what is entropy you should have a clearer idea of what are doing decision trees.

By using entropy, decision trees tidy more than they classify the data.

Why Entropy used for making decisions in Decision Tree??

Written by Desik DP