Implement Automated Machine Learning with AutoML node in Knowledge Studio
A machine learning workflow includes iterative steps such as data preparation, modelling and validation to select the best predictive model. Automated Machine Learning (AutoML) aims to automate the iterative tasks allowing data scientists to compare performance of multiple models swiftly which enables them to make faster and better decisions. I would like to share the steps involved to implement in AutoML in Knowledge Studio.
The dataset employed to demonstrate AutoML contains elements composition by percentage weight of material type: low-alloy steels along with temperature at which mechanical properties (Tensile Strength, 0.2% Proof Stress, Elongation, % Reduction in Area) have been observed. The current methods available requires elaborate physical testing procedures to accurately calculate these properties. Machine Learning will be used to develop a model to predict the mechanical properties by using element composition and temperature.
Figure 1 : AutoML Workflow
Once dataset is imported, we can connect the AutoML node which will prompt the user to select the dependent variable which is the output prediction for the ML model. In this example we’ll focus on predicting the tensile strength. The following options are available for data preparation and modelling:
- Variable Selection:
- Important input variables (element composition) can be selected based on high predictive power (variance explained, F-test, Pearson Correlation) with respect to the dependent variable (tensile strength). Number of variables to be selected automatically is specified by the user.
- We can exclude input variables form training dataset which are irrelevant to the ML model such as Alloy code, Elongation, Proof Stress and Reduction in Area.
Figure 2 : Variable Selection
- Data Preprocessing:
- Missing values if present for input variables can be replaced with mean/median of the distribution or can be removed from the dataset. Most machine learning algorithms require handling of missing values.
- Outliers are data points that they don’t lie within the general distribution of input variables which can lead to misleading results if included in the predictive model. We can define how to flag an outlier using statistical measures for a distribution (Mean,1st and 3rd Quartiles, Standard Deviation). Once an outlier is detected we can either replace it by the mean/median or remove from dataset
Figure 3 : Data Preprocessing
- Feature Engineering:
- New variables can be generated by transforming the input variables which will be useful to potentially improve performance of the machine learning model.
- Parameters for variable transformation can be set in advanced tab
- Transformations:
- Interaction Terms: Variables generated based on product of selected variables and/or their squares which helps in including non-linear effects for training linear or logistic regression
- Logarithm Transforms: Log transforms variable using user defined base
- Standardization: Rescales variables to lie within a certain range which is important for cases when variables have different ranges or has different measurement units
- Optimal Binning: Groups variables into new categories based on predictive power w.r.t dependent variable.
- Power Transform: Raises variable to power specified
- PCA: Data reduction method to combine correlated variables into a smaller set of uncorrelated variables which are known as principal components.
Figure 4 : Feature Engineering
- Modelling:
- This option allows user to select the machine learning models (Regression, Deep Learning, Decision Trees, Ensemble models) to predict the tensile strength.
- Model training and validation comprises of two methods:
- Train-Test: Dataset is partitioned to train and test data. The train data is used as an input to the machine learning model which tries to learn a function that can relate element composition with the tensile strength. Model performance is then evaluated on an unknown dataset containing a different set of element composition (Test dataset).
- Cross Validation: This method applies different combination for test and train data to give a more generalized performance of the model
- Model parameters can be set using Advanced/Grid Search Tab. Grid Search allows user to experiment with different combination model parameters. Each combination of parameters is considered as unique model which is evaluated against the test dataset. This method is commonly referred as hyperparameter tuning which aims to select parameters which gives best possible results. AutoML gives an option export/import grid search parameter.
- An example of Grid Search would be to evaluate performance of models for Deep Learning Model for different Nueral network configurations (Number of layers and number of neurons for each layer). Deep learning model with least error (Error Metric: Mean Square Error) for tensile strength prediction will be selected.
Figure 5 : Modelling and Grid Search
Once the AutoML settings are completed by the user, the machine learning workflow will be generated automatically, and the best machine learning model will be selected by validating with test dataset based on the mean square error metric since the dependent variable (tensile strength) is numeric. Visual representation of model performance (Bias, Accuracy, Scatter Plots, Error) is analyzed and compared in the model analyzer. The last step is creation of an AutoML report to summarize the input settings, model validation results and best model.
In the present workflow we are comparing 3 models: Deep Learning, Linear Regression and Random Forest. The best model is Linear Regression (Data_ML_Reg2) with an MSE of 5616.453
Figure 6 : Model Training Parameters and Results
Figure 7 : AutoML Report
Download and access files associated with this blog post:
Dataset : https://www.kaggle.com/datasets/konghuanqing/matnavi-mechanical-properties-of-lowalloy-steels