Decision tree fundamentals - Gini impurity and entropy | ML foundations | ML in Julia [Lecture 15]

preview_player
Показать описание

Gini Impurity measures how often a randomly chosen element from a dataset would be incorrectly labeled if it was randomly labeled according to the distribution of labels.

Gini impurity = 1-Σpi^2
Here pi is the probability of the i'th class

Compare this with entropy
Entropy = -Σpi*log2(pi)

When you have only 1 class, pi=1.

Then Gini impurity = 0 = Entropy

If you have many classes (n) and 1 item in each class, then
Gini impurity → 1
Entropy = log2(n)

Both Gini impurity and entropy say similar things, but Gini impurity is simpler to calculate.

If all data points belong to the same class in a node, the Gini impurity = 0. These nodes can be called “pure leaves”.

So when to use Gini and when to use entropy?

⇒ When computational efficiency is critical, or when class proportions are imbalanced use Gini.
⇒ When you prioritize a deeper understanding of the information gain, or when you are dealing with balanced data and prefer a probabilistic approach use entropy.

Рекомендации по теме