Do you know how does Multicollinearity Analysis node work in Knowledge Studio?
A client recently asked me about the Multicollinearity Analysis node functionality in Knowledge Studio. Specifically, when defining a threshold, how the selection happens if two variables are highly correlated and which of the two variables will be kept. As this might be a question for other users as well, I decided to write a blog!
Let me start by talking about Multicollinearity Analysis node a bit and then I'll explain how the variable selection is applied.
The Multicollinearity Analysis node is in the Profile palette. It detects multicollinearity among numeric variables and automatically removes highly correlated variables based on a user-defined cut-off value of the correlation coefficient.
Unlike the Correlations tab in datasets, which calculates the correlation coefficients for all variables and shows the correlation matrix, the Multicollinearity Analysis node also helps you decide which ones of the correlated variables should be removed. It is especially useful when you have many variables, and it is hard to make decisions just by viewing the correlation matrix.
How is the variable selection applied?
Once you click Analyze button, Maximum and Average absolute correlations are calculated for each variable and shown in Analyzed variables frame.
Maximum Absolute Correlation is the maximum absolute value of the correlation coefficients of the given variable with all other variables.
Average Absolute Correlation is the average absolute value of the correlation coefficients of the given variable with all other variables.
All columns in Analyzed variables frame are sortable.
To automatically select the variables whose Maximum Absolute Correlation does not exceed a certain threshold, enter the desired threshold value in the box Max absolute correlation cut-off and click Select.
Some variables with Max absolute correlation exceeding the cut-off will also get selected according to this rule: Suppose Variable V1 has Max Absolute Correlation greater than the cut-off, and the corresponding highly correlated variable is V2. Then the value of Average Absolute Correlation will be compared for V1 and V2, and the one with lower Average Absolute Correlation will get selected, while the other will get excluded.
Example: Assume the user-defined cut-off value is 0.7. After clicking Select, in the table below, the pair “Amount Lost” and “Profit” has Max Absolute Correlation = 0.925, which is above the cut-off value. Therefore, only one of these two variables are auto-selected as a result of comparing their Average Absolute Correlation values. For “Profit” it is 0.114, which is lower than for “Amount Lost” which is 0.138, and therefore "Profit" is selected, while "Amount Lost" is not.
All non-numeric variables are listed in the Not analyzed variables frame at the bottom-right. The variables in this list are not analyzed, but they are included in the output dataset. You can exclude them from the output using the left and right arrow buttons or just double-clicking on them.
Multicollinearity is leading to unreliable and unstable estimates of regression coefficients and is not good, but a good analyst or data scientist can handle it with the right tools and knowledge. We hope this node can help you deal with this problem in a more effective and easier way.
Have you ever run into issues with multicollinearity? How did you solve the problem?
Reference: Altair Knowledge Studio Help (within the software)