The Best Kept Secret about Decision Trees

Kate Blizinski
Kate Blizinski
Altair Employee
edited July 2022 in Altair RapidMiner

BY: Natasha Mashanovich, Lead Data Scientist at Altair

“I am a future teller.”

“I tell a naked truth.”

“I can read others great minds.”

“Everyone loves me.”

“I am … a DECISION TREE!”

Synopsis
Decision trees can be used for predictive modelling and machine learning, however, its other capabilities are less acknowledged! Unlike most machine learning techniques, a decision tree is a fully transparent model and as such can be utilized as a surrogate for complex analytical models such as neural networks. Also, the ability to visualize a decision tree makes it an ideal insight tool, especially for capturing complex interactions.

Through the prism of several selected use cases, this blog demonstrates the multi-purpose application of decision trees. Firstly, we demonstrate its commonly used predictive abilities. Secondly, we disclose our methodology for explaining black-box models using a decision tree as a surrogate model. Thirdly, we exhibit its capabilities to derive insights from different business perspectives. Finally, we present the Analytics Decision Tree Editor.

Did you know a Decision Tree is:

  • Multi-purpose.
  • It’s simple
  • It’s one of a few modelling techniques that can be visualized.
  • As a picture, it tells a thousand words.
  • It can be a classifier – binary or multi-class – or a regressor.
  • It’s easy to deploy in any programming language.
  • It has simple and fast hyperparameter optimization.
  • Tree growing can be algorithmic, manual, or combined.
  • It’s a building block of other modelling techniques, such as Random Forest.

 

Predicting the future
Model prediction
Decision tree (DT) is a staple modelling technique, perhaps not the most powerful, although you never know until you have tested it; however, it is a great technique for rapid models, feasibility studies or benchmark models. Decision trees can be used as a classifier, for either binary and multi-class prediction or as a regressor for predicting numeric values. The most popular algorithms utilized for growing a decision tree are ID3, C4.5, CART, CHAID and MARS.

Here is a use case example from the finance sector (Table 1). Three different decision tree techniques, C4.5, CART, and BRT (a proprietary analytics algorithm) are utilized to build model candidates predicting the probability of default.image

 

image

The decision tree models are trained using different algorithms with similar node settings (Figure 1). The best performing model for this use case is the Binary Response Tree (BRT), an analytics proprietary algorithm for binary classifiers. In this example, the difference between BRT and CART model is minor; hence, decisions about the model champion could be arbitrary or based on decision variables included in the model.

Inside a black-box
Distilling a machine learning model
Model distillation, often referred to as knowledge distillation, is a process of transforming a complex model to a simpler proxy model as the latter should be easier to explain, faster to run, and less expensive to implement.

Many analytical solutions offer special modules for model distillation, however a simple methodology involving a decision tree could be as effective as more complex distillation techniques.

Model distillation methodology
A model distillation methodology using analytics workflow is depicted in Figure 2. This involves several elementary steps. After building a modelling view, the first step is data partitioning; at minimum two, but ideally three partitions: train, valid and test, should be used. The next step is to build a machine learning (ML) model. Once satisfied with the model performance, proceed with model distillation. The process itself is straightforward:

  1. Score the ML model on a separate dataset (that is the valid partition in Figure 2). Using a dataset scored on the ML model is the key to this methodology as the ML’s predicted values become the target variable in step 4 below.
  2. Select a regression decision tree (DT), such as CART for model distillation.
  3. Select the same model inputs as for the ML model.
  4. Train the DT model by selecting the ML’s predicted values to be the target variable. Vary the tree node size either on a single tree or create several distilled candidates as in Figure 2 (for example: 25, 10 and 5%, respectively).
  5. Visually examine the structure of distilled DT candidates, their performance and model predictors (Figure 3).
  6. Train a benchmark model. This is an optional step and typically involves building a decision tree model, predicting the same target variable, and having the same inputs as the ML model. The purpose of the benchmark model is to assess the quality of distilled model candidates and the similarity of predictors.
  7. Score a dataset with all models: the ML, the distilled models, and the benchmark model (that is the test partition in Figure 2).
  8. Compare the models using a preferred performance metric (Figure 4).

image

image

Training a distilled model on the prediction scores of an ML model is an implicit way of extracting the ML’s “thinking process”, which is key to the approach. The model variables extracted by the DT would be the most predictive variables in the ML model. Therefore, the distilled model becomes a proxy or a “mental model” of the ML model. Establishing the balance between the complexity of the explainable model and its accuracy is subjective (Figure 3). Obviously, increased complexity means better model performance (Figure 4); however, be aware that the ability to explain distilled models diminishes with increasing complexity.

image

Although the benchmark model is optional, it is highly desirable as it provides a fair comparison between two DT models – one using the original target variable and another using the ML’s prediction as the target variable. The aim is to create a distilled model with ideally better predictive power than the benchmark model. Do not be surprised if one of the distilled models outperforms the ML model itself – not common but quite possible.

Let’s create a few distilled DT models and discuss key findings for use cases in Table 2.

image

Credit risk
We use the same use case presented in the previous section to examine if a better machine learning credit risk model can be trained instead of the DT models. A multi-layered perceptron (MLP) model challenger has been created with a better area under curve (AUC) value on the test partition than the BRT DT model in Figure 1 – 0.84 vs. 0.80, respectively. Employing our distillation methodology, an MLP surrogate model with minimal node size of 5% and AUC of 0.765 on the test partition is presented in Figure 5. The surrogate model has revealed checking status, credit history, loan purpose, saving status, loan duration and applicant’s age as the main MLP model predictors. For comparison, a benchmark model with a similar structure as the surrogate model has not extracted age or loan duration for predicting default, which happen to be important predictable variables in the MLP model.

image

Fraudulent transactions
The second use case is about predicting fraudulent credit card transactions. The dataset contains 29 input variables (amount and 28 anonymized variables V1-V28) and is highly unbalanced with 0.17% of target cases. Again, an MLP model is applied as an example of a black-box model. The MLP model has an AUC value of 0.985 on unseen data and has been analyzed using the distillation methodology outlined.

The distilled DT model has extracted seven predictive variables of which V17 is the first split followed by V12 and V14 (Figure 6). A benchmark DT model has extracted 12 predictors, of which six overlap with the MLP model. Our perpetual aspiration of doing more with less is clearly demonstrated here; the AUC on unseen data is better on the distilled model with 6 predictors than the benchmark model with 12 predictors – 0.919 vs. 0.883. Interestingly, re-training the benchmark model but limiting the input variables to the seven predictors from the surrogate model improves its AUC value to 0.894, which provides additional confidence in the distilled model being a good proxy for the MLP model.

image

Speed dating
The final use case is about finding a perfect match by predicting the likelihood of participants from a speed dating event meeting again. A Random Forest (RF) model with 500 trees, seven potential variables at each split and minimal split size of 0.5% has been trained with an AUC value of 0.84 on unseen data. Figure 7 compares the two almost identical segments – the RF distilled model (left) and the benchmark DT model (right) – showing the parameters that contribute the most to a second date. Both DT models extracted fondness, attractiveness and being funny, additionally, the distilled model extracted shared interests as another important aspect for successful second dating.

image

Telling a naked truth
Insights
Insight analysis is typically delivered via dashboards full of numbers, labels, charts, and diagrams with striking colors and useful drill-down mechanisms to help understand what is happening and why. A dashboard is a handy tool for business executives focusing on business-critical success factors (CSFs). Identifying CSFs is challenging for analysts so starting with something simple and effective such as a decision tree is a good way forward.

Whether designing an executive dashboard; or identifying interactions between one or more variables; or carrying out a root cause analysis; or enforcing some business rules; or building a predictive model; or discovering some useful patters from data; or preparing a presentation for management teams – a decision tree is a great place to start.

Decision trees can be utilized in exploratory analysis either to get new insights or confirm some hypotheses. Table 3 is a selection of use cases targeting different industries, each supplemented by a decision tree revealing some interesting insights.

image

@Retail marketers
Your customers are more likely to accept a new offer if they already accepted one in the past. The decision tree output helps you identify these segments by displaying them in the reddest-colored nodes (Figure 8). The best months for running campaigns are March, June, September, and December (obviously!); however, better not to run a campaign on the first day of a month or spam your customers.

image

@Retailers fighting fake customer reviews
Focus on un-confirmed purchases only, especially if the reviewers have used fewer long words in their reviews or gave extreme product and/or service ratings. They are more likely to be deceivers! (Figure 9).

image

@Telco marketers running retention campaigns
Focus on customers with a higher average number of daily calls and those with more than three customer service calls (or find a root cause for it!). The likelihood of losing them is greater than 55%! (Figure 10).

image

@HR managers and employers
Those likely to leave are employees with less than 18 months in service or those with some “challenging” managers. Long-service employees are generally safer; however, be aware that motivation is key for those on a lower salary (Figure 11).

image

@Anyone involved in Customer Relationship Management (CRM)
Customer segmentation is probably the most popular analytical technique for CRM. If you are using a clustering algorithm to identify customer segments, cluster labels such as 1, 2, 3… can be un-intuitive. Re-running your dataset through a decision tree with cluster labels as the target can help understand clusters and attach meaningful labels to them. Here is an example of car evaluation segments “run-through” a decision tree. Analyzing the tree, cluster 1 consists mainly of 2-seater cars in the highest price range and lower safety standards. Clearly, a better label for cluster 1 would be “Sports cars” (Figure 12).

image

@Everyone
Fight COVID-19 – and many other diseases – by cutting out the consumption of alcoholic beverages! Also, cut down on your sugar intake and eat more pulses ????. Nothing surprising but hopefully a good reminder. @Meat-eaters: eating more offal meat, such as chicken liver is good – as is seafood (Figure 13).

image

The Altair Analytics Workbench Decision Tree Editor
The latest and the greatest
Many commercial and open-source decision tree applications and packages are available nowadays. The majority of decision tree models are designed for prediction only, neglecting visualization as a key feature for insight or knowledge distillation. A remedy for this shortcoming is the analytics decision tree editor, equipped with the following great features:

Prediction and insights capabilities

• Regression and classification (binary and multi-class)
• Setting the dependent variable, target category, weight variable and independent variables
• Assigning variable treatment: interval, nominal, ordinal
• Tree growth: automatic, manual, combined
• Auto-growth algorithms: CART, C4.5, proprietary BRT
• Auto-growth options: pruning, node size, tree depth, include or exclude missing values
• Manual growth: optimal, equal width, equal height, Winsorized and manual
• Manual growth settings: optimal binning measures, number of child nodes, node size, include or exclude missing values, force monotonicity
• Optimal binning measures: Optimal binning measures: information value, Gini variance, entropy variance and chi-square
• SQL and SAS language deployment code

Visualization capabilities
• Fully-interactive editor
• Tree-growth control: split/join nodes, change boundary values
• Quick access buttons for key functions
• Node information and frequencies
• Node chart and table view
• Copy node information as table function
• Copy node information as image function
• Copy node records
• Heat map of the full tree structure
• Color coded nodes based on concentration of the target category
• Zoom features
• Export tree as image
• Output statistics: model info, confusion matrix
• Leaf nodes characteristics by target categories
• Full tree growth history for audit purposes

The  decision tree editor is included as a standard capability with Altair Analytics Workbench. Contact us today to try it!

 

References
Table 1, Table 2 – [1] https://www.openml.org/d/260

Table 2 – [2] Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015 (https://www.openml.org/d/1597)

Table 2 – [3] https://www.openml.org/d/40536

Table 3 – [4] FAO.FBS. License: CC BY-NC-SA 3.0 IGO. Extracted from: http://www.fao.org/faostat/en/#data/FBS (for year 2017). Date of access: 15-10-2020.

Table 3 – [5] Author: Worldometer.info; Published 15 October 2020; Place of publication: Dover, Delaware, U.S.A. (https://www.worldometers.info/coronavirus/)