In this Python ML project, we will explore an individual DecisionTree() and learn to use pre-pruning and post-pruning techniques. Prepruning techniques are generally easier to use and involve setting hyperparameters that limit the growth of our decision trees. Post-pruning techniques are a little harder to work with but very important to understand how to use the when applying them to a random forest of decision trees. Let's practice with pre and post-pruning techniques in Sklearn in this Python Machine Learning Project.
Part 1
Send Data Science Teacher Brandyn a message if you have any questions
dataGroups:
Part 2
Part 3
Part 4
Part 5
We will use cost complexity path pruning techniques in the poat pruning of our DecisionTreeClassifier().
In this ML classification task we look at the building a user defined function to plot our predictions. In classification it becomes very important to understand if your errors are False Positives or False Negatives as your business use case will have very different real-world costs to them.
An unpruned decision tree as the propensity to overfit because it just keeping growing around every little nook and cranky in our dataset. As data scientist we need to learn to control our model because an overfit model won't just give bad predictions it could give crazy predictions.
Controlling if our model is overfitting by using hyperparameters like max_depth will to a lot to prevent overfitting. However, it generalizes that both sides of the decision tree at truncated at the same depth where one side of the tree might benefit from more splits without overfitting.
In post-pruning we grow a tree and let after examing each branch we prune the necessary ones. The post-pruning technique we will use is the cost complexity path technique where we look at the ccp_alpha hyperparameter and using a for loop examine the tree's accuracy as we slowly increase the ccp_alpha hyperparameter to determine where we should set the parameter to let each branch grow it's optimal depth.
We plot the accuracy train and test scores for the unpruned decision tree and the prepruned decision and the postpruned decision tree. We can see that we achieve a much-improved score with the prepuning techniques like setting the max_depth. ccp_alpha is much harder to set and we grow out our tree and study the tree to determine where to set ccp_alpha to allow each branch to grow it's optimal amount without overfitting. We achieve a smaller increase in the accuracy scores of our test set with this technique but optimize our model further.
Comments