Replace missing values with average in each cluster

painfulover
painfulover New Altair Community Member
edited November 5 in Community Q&A
Hello,
I'm new to Rapidminer and I would like to replace missing values based on clustering, which means I have used k-means on columns which have no missing values and divide the original exampleset into 5 clusters. Now I would like know how to replace each row's missing values by the averages of the cluster it belongs to instead of the averages of whole attributes. I can only find the way to do the latter by the operator [replace missing values].
Thank you very much.

Best Answers

  • BalazsBarany
    BalazsBarany New Altair Community Member
    Answer ✓
    Hi!

    This process is a bit involved. You get a "cluster model" from the clustering operator that you can apply to the data with missing values. However, you need to choose an operator that can work with missing values itself. Then you would aggregate the clustered original data (the non-missing data), grouping by the cluster to get the averages. You can join the result with the missing values and use e. g. Loop Attributes to fill in the missing values using Generate Attributes with a formula like if(missing(%{attr}), eval("average(" + %{attr} + ")"), %{attr})

    It is much easier to use the Impute Missing Values operator that automates the selection of missing values, building a model for predicting them (you can select the model type) and putting the predicted values into the missing cells. There is an example process in the operator help that shows you how to use it, with k-NN as the example learning algorithm. 

    Regards,
    Balázs
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓
    Hi,
    I would just use Group into Collection and create a collection (list) of example sets, where each example set only contains one cluster. You can then use Loop Collection and in there use Replace Missing to replace it with the respective means. Afterwards you just Append the resulting collection again.

    Cheers,
    Martin

Answers

  • BalazsBarany
    BalazsBarany New Altair Community Member
    Answer ✓
    Hi!

    This process is a bit involved. You get a "cluster model" from the clustering operator that you can apply to the data with missing values. However, you need to choose an operator that can work with missing values itself. Then you would aggregate the clustered original data (the non-missing data), grouping by the cluster to get the averages. You can join the result with the missing values and use e. g. Loop Attributes to fill in the missing values using Generate Attributes with a formula like if(missing(%{attr}), eval("average(" + %{attr} + ")"), %{attr})

    It is much easier to use the Impute Missing Values operator that automates the selection of missing values, building a model for predicting them (you can select the model type) and putting the predicted values into the missing cells. There is an example process in the operator help that shows you how to use it, with k-NN as the example learning algorithm. 

    Regards,
    Balázs
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓
    Hi,
    I would just use Group into Collection and create a collection (list) of example sets, where each example set only contains one cluster. You can then use Loop Collection and in there use Replace Missing to replace it with the respective means. Afterwards you just Append the resulting collection again.

    Cheers,
    Martin