Skip to main content

Decision trees

Decision trees is a non parametric multi class classification technique. It is the oldest classification technique which has become the basis of more advanced classification techniques such as : 
  • Random forests 
  • Gradient Boosting 
It works on the principal of Entropy Minimization. 

Entropy means Chaos. Decision Trees tries to mimimize Chaos and organize things to the maximum. 

Decision Trees will maximize splits in a given data to get the most accurate result. 

Some of the commonly used terms in Decision Trees: 

1. Root Node : It represents entire population or sample and this further gets divided into two or more homogeneous sets. 

2. Splitting : It is a process of dividing a node into two or more sub nodes. 

3. Decision Nodes : When a sub node splits into further sub nodes, then it is called Decision Node. 

4. Leaf/ Terminal Node : Nodes that do not split are called Leaf or Terminal Node. 

5. Pruning : When we remove sub nodes of a Decision Node, this process is called Pruning. You can say that it is the opposite process of Splitting. 

6. Branch / Sub Tree : A sub section of an entire Tree is called Branch or Sub-Tree. 

7. Parent and Child Node: A node which is divided into sub nodes is called Parent Node of Sub Nodes. So sub Nodes are the child of Parent Node. 

The following diagram is an illustration of the various terms given above. 
 Note : A is the Parent Node of B and C. 

To determine which factors are contributing to the Profitability of a Company there are 3 distinct Techniques of Division / Splitting : 

1. Gini Index. 

2. Chi - Square. 

3. Information Gain. 

Steps to calculate Gini Index for a Split. 
  • Calculate Gini for Sub- Nodes, using formula sum of square of probility for success and failure.( p^2+q^2). 
  • Calculate Gini for a split using weighted gini score of each node of the split. 
Chi-Square 

It is an algorithm to find out the statistical significance between the differences between sub nodes and parent nodes. We measure it by a sum of standardized differences between observed and expected frequenxies of target variable. 

Features: 
  • It works with categorical target variable “Success” or “Failure”. 
  • It can perform two or more splits. 
  • Higher the value of Chi – Square, higher the statistical significance of differences between Parent node and sub note. 
  • It generates a tree called CHAID (Chi – Square automatic Interaction Detector. 
Information Gain – (Entropy Minimization) 

Entropy as mentioned earlier means Chaos. 

Decision Trees minimizes Chaos. 

The chart below explains the same. 


Entropy can be found out using the following formula 

Entropy = -(p log2 p)- (q log2 q). 

Here p and q are probabilities for success for that node. Entropy is also used with categorical target variable. It chooses the split which has lowest entropy compared to parent node and other splits. The lesser the entropy the better it is. 

Steps to calculate Entropy for a split: 
  • · Calculate entropy of parent node. 
  • · Calculate entropy of each individual nodeof split and calculate weighted average of all sub-nodes available in split. 

Advantages of Decision Tree 

Easy to understand Decision Tree output is very easy to understand even for people with non analytical background. It does not need any statistical knowledge to read or interpret them. Trees can be visualized. 

Useful in Data exploration Decision tree is one of the fastest way to identify most significant variables and relation between two or more variables. With the help of Decision trees, we can identify features that have better power to predict taget variable. For example we are working on a problem where we have information available in hundreds of variables, there Decision tree will help to identify most significant variable. 

Less data cleaning required. It requires less data cleaning compared to some other modeling techniques. It is not influenced by others to a fair degree. However this module does not support missing values. 

Data type is not a constraint it can handle both numerical and categorical variables. 

Uses a white box model if a given situation is observable in a model, the explanation for the condition is easily explained by Boolean logic. By contrast in a black box model (i.e. in an artificial neural network), result are more difficult to interpret. 

Challenges 

Over fitting: Over fitting is on of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model arameters and pruning. 

Not fit for continuous variables: while working with continuous numerical variables, decision tree loses information when it categorizes variables in different catagories. Binning the continuous variables might lead to better results. 

Unstable : decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees with an ensemble. 

Greedy Algorythm: the problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the featuresand samples are randomly sampled with replacement. 

Biased tree: decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree. 

Classification vs. Regression trees 

Classification trees are used when dependant variable is categorical. 

Regression trees are used when dependant variable is continuous. 

In case of classification tree, the value (class) obtained by terminal node in the training data is the mode of observations falling in that region. Thus if an unseen data observations falling in that region, we will make its prediction with mode value. 

In case of Regression tree, the value obtained by terminal node in the training data is the mean response of the observation falling in that region. Thus if an unseen data observation falls in that region, we will make its prediction with mean value. 

Both the trees divide the predictor space(independent variables) into distinct and non-overlapping regions. For the sake of simplicity, you can think about these zones as high dimensional boxes. 

Both these trees follow a top down greedy approach known as repursive binary splitting. We call it as top down because it begins from top of the tree when all the observations are available in a single region and successfully splits the predictor space into two new branches down the tree. 

It is known as greedy because, the algorithm cares(looks for best variable available)about only the current split, and not about future splits which will lead to a better tree. 

This splitting process is continued until a user defined stopping criteria is reached. For example we can tell the algorithm to stop once the number of observations per node becomes less than 50. 

In both the cases, the splitting process results in fully grown trees until the stopping criteria is reached. But, the fully grown tree is likely to overfit data, leading to poor accuracy on unseen data.

Learn Machine Learning using Python with palin Analytics.

Comments

Popular posts from this blog

Online Data Analytics Training Course | Data Science Certification

 Palin Analytics is a great institute for online Data analytics courses. Palin Analytics training institute in Gurgaon. We provide online Data analytics courses and certificate after course. You learn online data analytics course with Palin Analytics.  We will learn data wrangling, data cleaning similarly to data visualization the usage of famous Python libraries like Numpy, Pandas, Matplotlib, and seaborn . 6 Months Complete Access of students. We provide a Certificate after course completion. Data Analytics gives a subtle manner to investigate the data qualitatively and quantitatively to draw logical conclusions. Gone are the instances wherein one has to mirror on attention on collecting the data and saving the data to shape the clusters.  The Data Analytics course curriculum gives sizeable information on Data Collection, Extraction, Cleansing, Exploration, and Transformation. Alongside Data Mining, Data Integration is completed with the function of Engineering to con...

Top Five Online Training and Certification Course You Must Do In 2021

 Palin Analytics provide online training and certification courses. We offer the Top 5 most popular online courses. The Data Science program with Python enables students to gain experience analyzing using the Python language. We offer the Top 5 best online training courses: Data Science, Data Analytics, Data Science Using R, Artificial Intelligence, Business Analytics. You can learn from everywhere. Palin Analytics Training Institute in Gurgaon. We offer online training certification courses. Top Five Online Training and Certification Course Data Science Palin Analytics Offers the best data science course on the market, covering the complete concepts of data science from basic to advance. Palin Analytics offers online data science training and has helped many students find their dream job. You can learn about data science basic to advance with seasoned industry-level instructors, you need to learn from us and prepare for the industry. To Enroll In The Data Science Course Click Here...

Data Analytics Online Training & Certification Course With Python

 Palin Analytics offers online training and certification course. Palin Analytics training provide by top experts. you can study online from everywhere in India. Get without issues get right of entry to on smartphone and laptop. Online Data Analytics Course The course curriculum has been designed through Palin Analytics and taken into consideration to be quality with inside the industry. Gone are the instances wherein one has to mirror on attention on accumulating the data and saving the facts to shape the clusters. For the subsequent few years, it’s all approximately Data Analytics and its strategies to enhance the modern-day era which includes Machine learning.  At Palin Analytics, the Data Analytics course curriculum presents great data on Data Collection, Extraction, Cleansing, Exploration, and Transformation. Alongside Data Mining, Data Integration is accomplished with feature Engineering to construct Prediction models for Data Visualization and deploying the solution. Y...