Weight of Evidence: A Universal Modeling Gem

Kate Blizinski
Kate Blizinski
Altair Employee
edited July 2022 in Altair RapidMiner

BY: Natasha Mashanovich, Lead Data Scientist at Altair

Modeling myths

There are two common beliefs about predictive modeling we frequently come across:

Machine Learning models are superior to logistic regression.

What if we can make logistic regression with magical variable transformations as good as it gets when using Machine Learning (ML) models? Would you still choose an ML model and take the risk of having a hassle with endless model training time, numerous parameter optimizations and almost impossible model explanation?

Scorecard models can only be used in credit risk.

The traditional credit scorecard risk model is a special form of logistic regression. Its rank ordering is identical to the rank ordering of the logistic regression model. However, its scoring formula is simpler and more intuitive than the probability formula of its model equivalent. What if you were tasked to present your model to a non-technical audience? Would you rather talk about estimates, probabilities and log-odds or walk them through the scoring system and total sum of points?

For us, these are only modeling myths. Let’s prove it!

Prologue

Here is a ‘simple’ problem to solve. For a given decision space shown in Figure 1 build a model to predict if a coordinate (x, y) lies within the yellow circle. This problem can be easily solved using the Pythagorean Theorem. However, the objective of this exercise is to test the predictive power of different modeling techniques based on the complex decision problem with the quadratic relationship between decision variable (that is, in/out of the circle) and (x, y) coordinates.

image

The decision space is simulated by creating 3 different datasets (train, valid and test) using uniform distribution (Table 1). The simulated data has 3 variables: 2 input variables representing x and y coordinates and a binary variable inCircle indicating if a (x, y) coordinate lies within the circle (value 1) or outside the circle (value 0). The simulated datasets are balanced with roughly 50/50 proportion of coordinates in and out of the circle.

image

Three modeling techniques are applied, and their predictive power compared (Table 2). All three techniques are binary classifiers with inCircle being the dependent variable.

The first modeling technique is logistic regression with x and y inputs. The second approach is also logistic regression with the same setting as the previous one; however, input variables are WoE-transformed x and y values (more about this in the next section). The third technique is based on neural networks, specifically Multi-Layered Perceptron (MLP) with the structure shown in Table 3.

 

image

The model performance is assessed using two metrics: Area Under Curve (AUC) and misclassification (error) rate with 0.5 probability threshold for in/out class prediction. As expected, the MLP model has the highest AUC value and lowest error rate, however, what comes as a nice surprise is the superior performance of the logistic regression model when combined with WoE-transformed independent variables. So what is WoE transformation?

image

Instead of Wikipedia

Weight of Evidence (WoE) transformation is a special form of binning, where each bin is given a numeric value calculated using the formula below.

image

Any binning option can be applied (manual, equal width, equal height, winsorized or optimal), however it is optimal binning which typically leads to better model performance. In theory, WoE values can be calculated for any number of bins. In practice, to prevent model overfitting, choose no more than 10 bins and avoid bins with a small number of observations.

WoE transformation requires two variables: a binary dependent variable (DV), which is generally the focus of a predictive model, and an independent variable (IV). The IV is binned and binning is guided by the DV categories, one of which must be selected as the target category, as per Equation 1.

image

In Figure 3, the independent variable interest_rate is binned in relation to the dependent variable default and its target category ‘Y’, using optimal binning. The WoE values assigned to the bins become risk carriers indicating the level of risk associated with each bin; the lower the WoE value the riskier the bin category. In this example, the interest rates greater than 23% are associated with the highest risk and given a negative WoE value calculated as follows:

image

Although the binning process itself transforms any variable to binned categories, the numerical values (that is, WoE) assigned to each bin actually transforms the variable to a numeric one – hence the WoE transformation transforms all variables, including categorical variables, to numeric, the desired feature for any modeling algorithm. Smashing!

image

Figure 4 is an example of WoE transformation of a categorical variable occupation. There are several points to emphasise here.

Firstly, since the target category is ‘Y’ – individuals with higher income – the negative WoE values are associated with higher income. In this case, occupations ‘Exec-managerial’ and ‘Prof-speciality’ have the highest concentration of individuals with higher income.

Secondly, missing values for occupation are associated with positive WoE value of 1.01. This is the second-highest WoE value, meaning that in general individuals who did not state their occupation typically have lower income. This is a valuable finding, where missing values could be a useful piece of information. In situations like this one, WoE transformation is invaluable as it can not only effectively deal with imputation of missing values but assign the right value so the WoE value for the missing occupation would be an indication of lower income.

Thirdly, the WoE transformation has eliminated the need for any dummy coding, with additional 14 indicator variables, and effectively performed a 1-to-1 mapping between categories and WoE values. Any categorical variable, regardless of the number of categories, is transformed to a single numeric variable.

Best friends forever

WoE transformations and logistic regression are compatible. The combination of formulas included in both techniques makes the absolute value of the parameter estimate for logistic regression with a single WoE-transformed variable equal to 1 (Table 4).

imageFurther calculations based on WoE, logistic regression and scaling parameters convert a logistic regression model to an additive model with points assigned to all bins, where the probability estimates given by the logistic regression are transformed to scores, calculated as a total sum of points across different bins (Figure 5). It is the WoE transformation formula that provides the perfect match to the logistic regression model so the model estimates can be translated to the scoring points using adequate scaling.

image

The miracles of WoE transformation

Transforms non-linear into linear relationships. With adequate binning, WoE can transform non-linear relationships into linear or at least monotonic relationships with log-odds. This is one of the main assumptions when training a logistic regression model; often challenging to satisfy with other transformations or more transformations being required to prevent violation of the modeling assumption. Figure 6 (the second graph) shows the result of transforming variable age to be a monotonic function of the WoE value.

image

Standardizes all model variables. This is another constraint for a number of predictive models. WoE transformation ensures all variables can be interpreted using the same measurement scale, regardless of their original scales and ranges. An example of this quality is illustrated in Table 5, showing similar WoE values for bins from four variables with varying ranges: age, interest rate, salary, and home ownership.

image

Prevents model overfitting. Having overly specific ranges (that is, too many bins) can cause model overfitting. Good coverage across all categories leads to a more robust model, hence it is crucial to create bins with ‘enough’ cases. In practice, bin size of minimum 5% observations is a general rule-of-thumb (Figure 7).

image

image

Transforms logistic regression model to the additive model with scoring points. This is the ideal concept for a non-technical audience to understand, provide feedback and state the requirements that can be easily merged into the model, if required. (Figure 5)

Applicable to any data - numerical or text. (Table)

Applicable to any data measurement scale – interval, ordinal, or nominal. (Table 6)

image

 

Transforms all variables into numeric values. (Table 6)

One-to-one transformation regardless of the number of bins (Table 6). Alternatively, one can use indicator variables (that is, dummy variables). Many predictive models perform internal transformation of categorical variables into dummy variables. This could be a problem with a large number of categories. Specific models such as logistic regression also require specification of a reference category.

Logistic regression with WoE-transformed variables is a fully transparent model. This is a superior advantage when compared with ML models. (Figure 5 and Figure 8)

Model performances comparable to ML models. See the Prologue section or read the rest of the blog!

Provides implicit imputation of missing values. Missing values can be binned separately, and a WoE value calculated (Figure 4). Alternatively, they can be merged with other bins with a similar WoE.

Provides implicit treatment of outliers. Like missing values, outliers are the beast in many modeling techniques. They need to be treated to prevent them from spoiling a model. The binning process involved in WoE transformation implicitly solves this problem without any further interventions. For example, any extreme salaries would be encapsulated within a bin with the highest salary range and have the same WoE value across that bin.

The fellowship of the scorecard model

Despite the growing popularity of machine learning models for superior performances there is a significant drawback. These are ‘black-box’ models and the lack of ability to explain what constitutes a machine learning model is a key obstacle to acceptance by business users. Hence, World Programming data scientists have an imperative quest – to come up with a simple, justifiable solution that is, intuitive and easy to interpret whilst still being as predictive as a good black-box model. The modeling fellowship of WOE transformation, logistic regression and model scaling, often referred as the scorecard model, has become our preferred modeling methodology, not only for credit risk models but for many other business models developed for our clients. Under champion-challenger scrutiny our scorecard models consistently outperform other candidate models on performance criteria.

In 2019, we published a blog about our modeling solution for predicting the 2019 Rugby World Cup winner. Several data scientists from our solutions team were tasked to create a predictive model on the emerging topic of the Rugby World Cup.

We opted for a scorecard model – a quick and simple model that is easy to present to our audience but still powerful. Using the WoE transformation with optimal binning and monotonic behavior for selected input variables we translated a logistic regression model to the additive one based on the scoring system presented in Figure 8.

Without compromising the model performance, we were able to accurately reflect historical behavior with data patterns extracted during the model training and portray the exact logic for prediction of future events. This provided an ideal setting for discussion with our attendees. Any ML model might have given us the same or possibly better performance but would have created uncertainty and reservation within the audience. Moreover, our modeling approach enabled better synergy with the expert’s opinion by modifying bin definitions and weights – something we could have hardly controlled with ML models.

image

Another serious drawback of ML models is hyper-parameter optimization. It is almost impossible to produce a good model without optimization. The problem becomes even more significant for neural networks with many hyper parameters and ultimately a large search space. Extra care is required when selecting parameters for optimization and their values. As the total number of possible combinations is determined by an exponential function, if we are not careful, model training could be very long. On the other hand, WoE transformations and logistic regressions run very quickly and there are no hyper parameters to optimize, which is an extra bonus.

As for model deployment, scorecard models are probably the easiest to deploy. Many commercial and free analytical tools have automatic deployment into different programming languages. But even without this capability or limited coverage of deployment languages, scorecard models can be effectively implemented in virtually ANY programming language either manually or programmatically with a simple script. All we need is a series of IF-THEN-ELSE statements in a target programming language (Table 7).

image

Last but not least, ML models have a steep learning curve and only well experienced data scientists are able to use them. On the other hand, scorecards are easy to build and with the right modeling suite and visual programming interface, the entire spectrum of data modelers, including citizen data scientists, would be able to use them.

Upgrading the fellowship

Use WoE transformations to increase the performance of ML models.

One of our ‘Rugby World Cup’ model candidates correctly predicted the top three teams, although sadly not in the right order. Nevertheless, the point to bring up here is to test if a neural network model (given the same model inputs) would outperform the existing scorecard model. We created a couple of model challengers using MLP and tested all three models on the 2019 Rugby Cup results (Table 8). Both MLP models, with and without WoE transformations, had the same network architecture: 10 inputs and 2 hidden layers with 25 neurons each. All three models had a similar AUC on the validation sample. Interestingly, they all performed even better on the actual 2019 World Cup results. Overall, MLP with WoE transformation had the best performance, followed by the logistic regression with the same WoE transformations and MLP without WoE being the worst among them.

WoE Transform Editor: a precious commodity

 Applying WoE transformation is a tough mission without a proper analytical tool. The first and obvious step is to search for a WoE package in one of the open-source languages, such as Python or R. Even though such packages do exist, the problem occurs on application. For example optimal binning on categorical variables; or monotonic WoE transformation on numeric variables; or to control bin size; or to assign a WoE value to missing values; or to force optimal binning, and monotonicity, and bin size, and missing values at the same time. Additionally, we may want to choose between different binning measures such as information value, entropy variance, Gini or chi-square. In an ideal world, we would like to visualize binning and transformations and for a premium feature we would really want to use all of that interactively!

To the best of our knowledge there is no such commodity available as an open-source tool. Several commercial products offer WoE editors with more-or-less similar functionalities. Our WoE Transform Editor is one such precious commodity, offering the following functionality:

  • Setting the dependent variable, target category and frequency variable
  • Selecting variables to transform
  • Assigning variable treatment: interval, nominal, ordinal
  • Force monotonicity option
  • Binning types: optimal, equal-width, equal-height, winsorized and manual
  • Optimal binning measures: information value, Gini variance, entropy variance and chi-square
  • Ability to optimize all selected variables with a single click
  • Ability to fine-tune variable one-by-one
  • Settings for fine and coarse classing
  • Include or exclude missing values
  • Bin size settings for both count and percentage
  • Information value of transformed variables
  • Visual graphs for WoE values, bin sizes and characteristic analysis
  • WoE table with relevant column values
  • SQL and SAS language deployment code
  • Fully interactive editor

With this little gem your modeling becomes a real pleasure. ????

Epilogue

Happily, ever after…

Our WoE Transform editor is included as a standard capability.

 

Comments

  • Eric Gao
    Eric Gao New Altair Community Member
    edited August 2023

    Hi Kate. 

    First, complement to the article, very well articulated. 

    I have a question on WOE transformation in logistic regression and would like to hear your opinion. 

    When we have a score card model based on WOE transformed variable and logistic regression, we created WOE binning and calculates a WOE value based on the good, bad distribution of the training data. Of course we check the p-value for the WOE transformed variable in the training sample and all final input variables should be significant. 

    However, if we want to test the WOE transformed variables' significance in a test data and we adopted the same WOE binning, the re-calculated WOE value can be different from the WOE value from the training data. Because the good/bad distribution is different in the test data. At this point, if one want to test the variable significance (p-value) of the variables, should we use the WOE value calculated from the train data or from the test data?

     Thank you