Should We Ignore Missing Data or Treat It?
Incomplete datasets may compromise the accuracy of machine learning models.
Dataset with missing data is frequently experienced in statistical analysis. Ideally, we want to work with the complete dataset as missing data can lead to inaccuracies due to potential biasing or distortion. And treating missing data is not always optional as most of the machine learning models do not have provision to automatically handle empty rows or columns. Handling missing data can be performed in two ways: elimination of the respective row/column or imputing the missing value. The former method leads to losing data which is not preferred especially when the dataset is small. The latter method is the substitution of missing values through various statistical techniques and is the theme of this blog.
Data imputation is a statistical procedure to replace missing data and when needed, is an important step in building machine learning models. Imputation can be achieved via many different techniques; some are as simple as replacing with min/max/mean values and some are advanced like regression-based substitution. Imputation is not challenge-free, and it may be counter-productive depending on the case. For example, incorrect application of imputation may cause biasing in your estimates.
Imputation methods are available through many public libraries of Python. However, if you are using one of the Altair Data Analytics Tools, it may be convenient to handle data manipulation within the software without needing an external process. HyperStudy, Knowledge Studio and RapidMiner offer their unique methods to handle missing data.
Figure 1. HyperStudy.
Figure 2. Knowledge Studio.
Figure 3. RapidMiner.
If you would like to share your experience and learn more about data manipulation in Altair Tools, please leave your comment below.