Entropy and Information Gain in Decision Tree are classifications of it. Before we get into this, let us understand better about Decision Tree and its uses.

## Decision Tree

If you read my article on ‘Classification Algorithms in Machine Learning‘, I have defined decision tree as: “The decision tree method classifies data in the form of a tree structure. A decision tree generates a set of rules that help in categorizing data given a set of attributes and their classes. It is easily understandable and is capable of dealing with both numerical and categorical data. It functions similarly to a flowchart.” As a result, it is one of the most powerful tools that help in classifying and predicting data.

## Uses

Many businesses employ decision trees to address problems. Decision trees can easily deal with complex datasets. Data analysts mainly use it to perform predictive analysis for tasks such as establishing company operations plans. We can also use a decision tree as a training algorithm for supervised learning in machine learning and artificial intelligence.

## How Does It Work?

### Terminology

The following are some terms that you should be a*ware of:*

**Splitting**: It is a process of dividing a node into two or more sub-nodes.**Branch**: A branch or sub-tree is a part of the entire tree.**Root Node**: A root node is the starting point from which all other decision, chance, and end nodes branch. It is the total population or sample.**Leaf Node**: A decision path’s end results are the leaf nodes. They don’t split or divide any farther.**Internal Node**: The nodes between the root and leaf nodes are the internal nodes. Decision and chance nodes are examples of this.**Pruning**: Complex decision trees give irrelevant data a lot of weight. Pruning can help prevent this by eliminating certain nodes.**Parent and Child Node**: A parent node is a node which splits into sub-nodes. Those sub-nodes are known as the child nodes.

### Types of Nodes

*A decision tree consists of 3 types of nodes:*

**Decision node**: The shape of a square denotes a decision node. A decision node is a node in a situation where the flow splits into multiple pathways. A decision node assists us in making a choice.**Chance node**– A circle represents a chance node. A chance node displays the probability of various results.**End node**: A triangle represents the end nodes. An end node displays the results.

As a result of connecting these distinct nodes, we generate branches. Nodes and branches can be combined in several ways to form trees of increasing levels of complexity.

### Assumptions

*The following assumptions are made by a decision tree:*

- The data used for training should be entirely a root.
- Just before creating the model, it is necessary to discretize continuous feature values. It is ideal for them to be categorical.
- Attribute values are the basis for the sequential distribution of data.
- A statistical technique determines which characteristics we should place as the tree’s root or internal node. The decision tree is based on the Sum of Product representation. The SOP is another name for disjunctive normal form representation.

The major issue in the decision tree is determining which qualities to consider as the root node and at each level. This is exactly what we call the “attribute selection”.

### Benefits

*A decision tree provides the following benefits:*

- Decision trees are simple and easy to understand.
- It has the ability to solve both classification and regression problems.
- A decision tree conducts classification without requiring a lot of processing power.
- It can work with continuous data, as well as categorical data.
- A decision tree can be used in combination with other decision-making tools with ease.
- It also assists in the prediction of outcomes.

### Limitations

*Some drawbacks of the decision tree are:*

- A decision tree is a weak learner. A single decision tree rarely produces excellent results. To create better ensemble models, several trees are frequently joined to form forests.
- They are prone to overfitting to the training data.
- A decision tree might be sensitive to outliers.
- They don’t work well with continuous variables.
- It’s possible for outcomes to be skewed in favor of the dominant class when using an uneven dataset.

## Classification of Decision Tree

### Entropy and Information Gain

#### Entropy

It is a metric used in information theory that evaluates the impurity or uncertainty in a set of data. It controls how a decision tree splits data. In simple words, entropy helps us to predict the result of a random variable. It enables us to determine how definite or unsure a random variable is, as well as how much knowledge we would acquire if we knew its value.

When all observations belong to the same class, the entropy is always zero.

There is no impurity in such a dataset. As a result, such a dataset is useless for learning. On the other hand, if we have a dataset with two classes, for example, the entropy will be one. This type of dataset is useful for learning.

Assume, for example, that my data contains repetitive numbers. It displays how many times each number was repeated. Now, if my data just included one number, “17,” and it was repeated 100 times, the entropy would be zero since all of the observations would fall under the number 17. On the other hand, if my data comprised the number “17” repeated over 50 times and the number “6” repeated over 50 times, the entropy would be one because there are two observations here, 17 and 6.

*The following is the formula for calculating entropy:*

‘Pi’ represents the maximum – likelihood of an element/class ‘i’ in a dataset.

### Information Gain

It is the measure of how much knowledge a factor or quality provides about a class and it also aids in determining the order of attributes in decision tree nodes. It is a very critical component in a decision tree because an attribute with the highest information gain will be the first one which will be either tested or split in a decision tree. The information gain helps in assessing how well nodes in a decision tree split. Therefore, the decision tree will always seek to maximize information gain.

*We use the following formula for calculation:*

We can calculate the information gain of each feature by estimating its entropy measure. In simple words, the information gain calculates the anticipated reduction in entropy as a result of sorting the features.

*To calculate the information gain:*

- Firstly, you need to calculate the
**entropy of target.** - Secondly, calculate
**entropy**for every single feature. - Thirdly, subtract
**entropy**from**entropy of target**. - Finally, we have calculated the information gain.