Decision trees is a non parametric multi class classification technique. It is the oldest classification technique which has become the basis of more advanced classification techniques such as :
- Random forests
- Gradient Boosting
It works on the principal of Entropy Minimization.
Entropy means Chaos. Decision Trees tries to mimimize Chaos and organize things to the maximum.
Decision Trees will maximize splits in a given data to get the most accurate result.
Some of the commonly used terms in Decision Trees:
1. Root Node : It represents entire population or sample and this further gets divided into two or more homogeneous sets.
2. Splitting : It is a process of dividing a node into two or more sub nodes.
3. Decision Nodes : When a sub node splits into further sub nodes, then it is called Decision Node.
4. Leaf/ Terminal Node : Nodes that do not split are called Leaf or Terminal Node.
5. Pruning : When we remove sub nodes of a Decision Node, this process is called Pruning. You can say that it is the opposite process of Splitting.
6. Branch / Sub Tree : A sub section of an entire Tree is called Branch or Sub-Tree.
7. Parent and Child Node: A node which is divided into sub nodes is called Parent Node of Sub Nodes. So sub Nodes are the child of Parent Node.
The following diagram is an illustration of the various terms given above.
Note : A is the Parent Node of B and C.
To determine which factors are contributing to the Profitability of a Company there are 3 distinct Techniques of Division / Splitting :
1. Gini Index.
2. Chi - Square.
3. Information Gain.
Steps to calculate Gini Index for a Split.
- Calculate Gini for Sub- Nodes, using formula sum of square of probility for success and failure.( p^2+q^2).
- Calculate Gini for a split using weighted gini score of each node of the split.
Chi-Square
It is an algorithm to find out the statistical significance between the differences between sub nodes and parent nodes. We measure it by a sum of standardized differences between observed and expected frequenxies of target variable.
Features:
- It works with categorical target variable “Success” or “Failure”.
- It can perform two or more splits.
- Higher the value of Chi – Square, higher the statistical significance of differences between Parent node and sub note.
- It generates a tree called CHAID (Chi – Square automatic Interaction Detector.
Information Gain – (Entropy Minimization)
Entropy as mentioned earlier means Chaos.
Decision Trees minimizes Chaos.
The chart below explains the same.

Entropy can be found out using the following formula
Entropy = -(p log2 p)- (q log2 q).
Here p and q are probabilities for success for that node. Entropy is also used with categorical target variable. It chooses the split which has lowest entropy compared to parent node and other splits. The lesser the entropy the better it is.
Steps to calculate Entropy for a split:
- · Calculate entropy of parent node.
- · Calculate entropy of each individual nodeof split and calculate weighted average of all sub-nodes available in split.
Advantages of Decision Tree
Easy to understand Decision Tree output is very easy to understand even for people with non analytical background. It does not need any statistical knowledge to read or interpret them. Trees can be visualized.
Useful in Data exploration Decision tree is one of the fastest way to identify most significant variables and relation between two or more variables. With the help of Decision trees, we can identify features that have better power to predict taget variable. For example we are working on a problem where we have information available in hundreds of variables, there Decision tree will help to identify most significant variable.
Less data cleaning required. It requires less data cleaning compared to some other modeling techniques. It is not influenced by others to a fair degree. However this module does not support missing values.
Data type is not a constraint it can handle both numerical and categorical variables.
Uses a white box model if a given situation is observable in a model, the explanation for the condition is easily explained by Boolean logic. By contrast in a black box model (i.e. in an artificial neural network), result are more difficult to interpret.
Challenges
Over fitting: Over fitting is on of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model arameters and pruning.
Not fit for continuous variables: while working with continuous numerical variables, decision tree loses information when it categorizes variables in different catagories. Binning the continuous variables might lead to better results.
Unstable : decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees with an ensemble.
Greedy Algorythm: the problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the featuresand samples are randomly sampled with replacement.
Biased tree: decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.
Classification vs. Regression trees
Classification trees are used when dependant variable is categorical.
Regression trees are used when dependant variable is continuous.
In case of classification tree, the value (class) obtained by terminal node in the training data is the mode of observations falling in that region. Thus if an unseen data observations falling in that region, we will make its prediction with mode value.
In case of Regression tree, the value obtained by terminal node in the training data is the mean response of the observation falling in that region. Thus if an unseen data observation falls in that region, we will make its prediction with mean value.
Both the trees divide the predictor space(independent variables) into distinct and non-overlapping regions. For the sake of simplicity, you can think about these zones as high dimensional boxes.
Both these trees follow a top down greedy approach known as repursive binary splitting. We call it as top down because it begins from top of the tree when all the observations are available in a single region and successfully splits the predictor space into two new branches down the tree.
It is known as greedy because, the algorithm cares(looks for best variable available)about only the current split, and not about future splits which will lead to a better tree.
This splitting process is continued until a user defined stopping criteria is reached. For example we can tell the algorithm to stop once the number of observations per node becomes less than 50.
In both the cases, the splitting process results in fully grown trees until the stopping criteria is reached. But, the fully grown tree is likely to overfit data, leading to poor accuracy on unseen data.
Learn Machine Learning using Python with palin Analytics.
Learn Machine Learning using Python with palin Analytics.


Comments
Post a Comment