Using the Execute Python Operator for Calculating Feature Importance in RapidMiner Studio


What is feature selection and why do we need to perform it?

In machine learning, one of the main components of data preprocessing is feature selection. Each column in the dataset that is fed into our machine-learning model is called a feature, also known as variable or attribute. If we use too many features to train a model, the model can learn from unimportant patterns. The process of selecting the most important features for developing a predictive model is called Feature Selection. Feature selection techniques reduce the number of input variables or features by removing redundant or irrelevant features and reducing the set of features down to those that are most relevant to our modeling approach. Feature selection provides the following benefits:

  1. Simplifies the modeling by increasing the explainability of the results
  2. Reduces the training time (and scoring time) and the required space volume
  3. Increases the model precision

We can classify the feature selection methods for supervised approaches (labeled datasets) into three main groups:

  1. Filter methods: In these methods, a statistical metric is used to remove irrelevant attributes. Information gain, Fisher score, and ANOVA F-value are some examples of Filter methods for feature selection.
  2. Wrapper methods: In these methods, a subset of variables is compared with other combinations of features which results in the detection of possible interaction between features. Some of the popular wrapper methods are Forward Selection, Backward Elimination, and Recursive Feature Elimination methods.
  3. Embedded methods: In these methods, machine learning algorithms are used to calculate the importance of each feature. Lasso Regression and tree-based models like Random Forest Importance are among the most popular Embedded methods for feature selection.

In this article, we will focus on the Embedded methods and implement the feature selection based on the XGBoost model.

Hands-on Experience in RapidMiner

We use the latest version of RapidMiner Studio (V10.0) to implement an example.

EDA

We use the House Price dataset, which contains 81 explanatory variables describing every aspect of 1460 residential homes in Ames, Iowa. This dataset is mainly used to predict the final price of each home based on the described features. “SalePrice” is our dependent variable (DV).

By performing a quick explanatory data analysis (EDA) over the dataset, we can see that some features are more relevant to “SalePrice”. For example, in the following plot, we can see there is almost a linear relationship between “OverallQual” and “SalePrice”. Moreover, we colorized this plot based on “YearBuilt” which is showing that newer homes have higher overall quality and sale prices.

Chart, box and whisker chartDescription automatically generated

Figure 1: Scatter plot of “SalePrice” and “OverallQual”. Color is based on “YearBuild”.

As it was mentioned above, this dataset has 81 features that complicate the EDA process and the interpretation of the final model. As the first step of preprocessing, we use the Turbo Prep tool in RapidMiner to remove the features with low-quality data. This tool automatically determines features with high nominal or integer ID-ness, missing values, stability, and a number of categories and removes them.

Figure 2: Removing low-quality features in Turbo Prep

After performing Turbo Prep, we managed to reduce the number of variables from 81 to 58.

Then, we removed highly correlated features by using the Remove Correlated Attributes operator.

Since many of the features are categorical, we use the Nominal to Numerical operator to change types of variables. Considering that we had many categorical variables in the initial dataset, the number of features increased to 229.

Feature Selection

The next step is feature selection. Our plan is to develop an XGBoost model in Python for calculating the importance of input variables.

The importance gives a value that shows how valuable each variable is in the trained model. The more a variable is used to make key decisions with decision trees, the higher its importance. This importance is calculated for each input variable, and then variables are ranked and compared to each other. Importance is calculated for a single decision tree by the amount that each split point improves the performance measure, weighted by the number of observations in the node. The feature importances are then averaged across all of the decision trees within the model [3].

To develop this, we use Execute Python in RapidMiner Studio to complete our workflow. Execute Python operator is part of the Python Scripting extension, which is available in the Marketplace, accessible via the Extensions menu. Python Scripting extension allows you to execute Python scripts within RapidMiner’s process.

Graphical user interface, text, application, emailDescription automatically generated

Figure 3: Python Scripting extension available in RapidMiner Marketplace

The following picture is showing the steps we take to build this process in RapidMiner.

Chart, diagramDescription automatically generated

Figure 4: RapidMiner process (workflow) for feature selection

How to use Execute Python operator in RapidMiner

In the Execute Python operator, we define “SalePrice” as the DV and the rest of the variables would be the independent variables (IVs). Then, we develop an XGBoost model (XGBRegressor) and call feature_importances_ to extract the importance of each feature. Then, we make a list of the top important features and pass them to the output of this operator.

 

Graphical user interface, text, applicationDescription automatically generated

Figure 5: The feature selection script within Execute Python operator

We connect the output port to the result port and run the process. The table of important features opens up. The top 20 important features are shown in the following table:

 TableDescription automatically generated

Figure 6: The list of top 20 important features for predicting "SalePrice"

By identifying the most important features, we can use them for developing a model to predict “SalePrice” which is the price of each home.

Final words

Feature selection is always an important step for simplifying the modeling or EDA of datasets with a lot of columns. Although there are many powerful feature selection methods available in RapidMiner, we can always use the Execute Python operator to develop any other method of choice.

 

References:

[1]

https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

[2]

RapidMiner Studio Help (within the software) 

[3]

https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/