Choosing the best approach to impute numerical missing values

Hello everybody,

as part of a scientific project I have to develop a data preprocessing model for the university. Currently I am struggling with missing values.

I have a data set with exclusively numerical attributes, in which numerous values are missing. Now I would like to implement the following in RM:

- for each attribute I would like to use 2-3 different methods (e.g. linear interpolation, quadratic interpolation, cubic interpolation, kNN algorithm; other algorithms which can used to impute missing values are also welcome) to replace the missing values with statistically calculated values.

- Then I want to calculate the performance of each method for each attribute and at the end select the best method for imputing missing values for each attribute.

It would be great if someone could help me.

Many thanks in advance

Moritz

Find more posts tagged with

AI Studio

Missing Values

Accepted answers

All comments

MartinLiebig

Hi,

i think one interesting question is - how would you define the quality of a imputation? Given different problems this can be vastly different?

BR,

Martin

MoWei

Hi Martin,

my idea was to first delete all examples where the value is missing in the examined attribute. Afterwards I used e.g. kNN algorithm to predict the desired values of the label attribute and then used the Performance (Regression) operator, which gives me the root mean squared error (RMSE). Isn't it somehow possible to compare the RMSE of the different methods afterwards and choose the best one?

Thanks

Moritz

YYH

Hi @MoWei,

Surely, you can get the performance of imputation. Here is an example validation on the knn imputation method applied to missing age of titanic data. For a regression performance, we would need two columns: a "ground truth" column and an estimation from knn.

<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.1.000" expanded="true" height="68" name="Retrieve Titanic" width="90" x="112" y="34">
        <parameter key="repository_entry" value="//Samples/data/Titanic"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples" width="90" x="246" y="34">
        <parameter key="parameter_expression" value=""/>
        <parameter key="condition_class" value="custom_filters"/>
        <parameter key="invert_filter" value="false"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="Age.is_not_missing."/>
        </list>
        <parameter key="filters_logic_and" value="true"/>
        <parameter key="filters_check_metadata" value="true"/>
        <description align="center" color="transparent" colored="false" width="126">use the data with non-missing age to validate the knn imputation methods</description>
      </operator>
      <operator activated="true" class="split_data" compatibility="9.1.000" expanded="true" height="103" name="Split Data" width="90" x="380" y="85">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.8"/>
          <parameter key="ratio" value="0.2"/>
        </enumeration>
        <parameter key="sampling_type" value="automatic"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <description align="center" color="transparent" colored="false" width="126">80% for training set, 20% for testing</description>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="9.1.000" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="514" y="136">
        <list key="function_descriptions">
          <parameter key="new_age" value="0/0"/>
          <parameter key="class" value="&quot;test&quot;"/>
        </list>
        <parameter key="keep_all" value="true"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="9.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="34">
        <list key="function_descriptions">
          <parameter key="new_age" value="Age"/>
          <parameter key="class" value="&quot;train&quot;"/>
        </list>
        <parameter key="keep_all" value="true"/>
      </operator>
      <operator activated="true" class="append" compatibility="9.1.000" expanded="true" height="103" name="Append" width="90" x="648" y="85">
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="merge_type" value="all"/>
      </operator>
      <operator activated="true" class="impute_missing_values" compatibility="9.1.000" expanded="true" height="68" name="Impute Missing Values" width="90" x="782" y="85">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="new_age"/>
        <parameter key="attributes" value="Age|Cabin|Life Boat"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="iterate" value="true"/>
        <parameter key="learn_on_complete_cases" value="true"/>
        <parameter key="order" value="chronological"/>
        <parameter key="sort" value="ascending"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <process expanded="true">
          <operator activated="true" class="k_nn" compatibility="9.1.000" expanded="true" height="82" name="k-NN" width="90" x="246" y="34">
            <parameter key="k" value="5"/>
            <parameter key="weighted_vote" value="true"/>
            <parameter key="measure_types" value="MixedMeasures"/>
            <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
            <parameter key="nominal_measure" value="NominalDistance"/>
            <parameter key="numerical_measure" value="EuclideanDistance"/>
            <parameter key="divergence" value="GeneralizedIDivergence"/>
            <parameter key="kernel_type" value="radial"/>
            <parameter key="kernel_gamma" value="1.0"/>
            <parameter key="kernel_sigma1" value="1.0"/>
            <parameter key="kernel_sigma2" value="0.0"/>
            <parameter key="kernel_sigma3" value="2.0"/>
            <parameter key="kernel_degree" value="3.0"/>
            <parameter key="kernel_shift" value="1.0"/>
            <parameter key="kernel_a" value="1.0"/>
            <parameter key="kernel_b" value="0.0"/>
          </operator>
          <connect from_port="example set source" to_op="k-NN" to_port="training set"/>
          <connect from_op="k-NN" from_port="model" to_port="model sink"/>
          <portSpacing port="source_example set source" spacing="0"/>
          <portSpacing port="sink_model sink" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">apply imputation on the missing values</description>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples (2)" width="90" x="916" y="85">
        <parameter key="parameter_expression" value=""/>
        <parameter key="condition_class" value="custom_filters"/>
        <parameter key="invert_filter" value="false"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="class.equals.test"/>
        </list>
        <parameter key="filters_logic_and" value="true"/>
        <parameter key="filters_check_metadata" value="true"/>
        <description align="center" color="transparent" colored="false" width="126">check performance e.g. RMSE on testing</description>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Set Role" width="90" x="1050" y="85">
        <parameter key="attribute_name" value="Age"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles">
          <parameter key="new_age" value="prediction"/>
        </list>
      </operator>
      <operator activated="true" class="performance_regression" compatibility="9.1.000" expanded="true" height="82" name="Performance" width="90" x="1184" y="34">
        <parameter key="main_criterion" value="first"/>
        <parameter key="root_mean_squared_error" value="true"/>
        <parameter key="absolute_error" value="true"/>
        <parameter key="relative_error" value="true"/>
        <parameter key="relative_error_lenient" value="false"/>
        <parameter key="relative_error_strict" value="true"/>
        <parameter key="normalized_absolute_error" value="false"/>
        <parameter key="root_relative_squared_error" value="false"/>
        <parameter key="squared_error" value="true"/>
        <parameter key="correlation" value="true"/>
        <parameter key="squared_correlation" value="true"/>
        <parameter key="prediction_average" value="true"/>
        <parameter key="spearman_rho" value="true"/>
        <parameter key="kendall_tau" value="true"/>
        <parameter key="skip_undefined_labels" value="true"/>
        <parameter key="use_example_weights" value="true"/>
      </operator>
      <connect from_op="Retrieve Titanic" from_port="output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Split Data" from_port="partition 2" to_op="Generate Attributes (2)" to_port="example set input"/>
      <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Append" to_port="example set 2"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_op="Impute Missing Values" to_port="example set in"/>
      <connect from_op="Impute Missing Values" from_port="example set out" to_op="Filter Examples (2)" to_port="example set input"/>
      <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Performance" to_port="labelled data"/>
      <connect from_op="Performance" from_port="performance" to_port="result 1"/>
      <connect from_op="Performance" from_port="example set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Maerkli

Hallo @yyhuang.

Please, can you explain what Generate Attributes exactly makes in the process?

Merci,

Maerkli

YYH

Hi @Maerkli,

The "generate attribute" will create a new column new_age for missing data imputation and another label for the class of train/test sets. 80% of the non-missing age will be kept before imputation, and you can change the split ratio in "split data". We pretend the rest 20% age will be missing and use knn to impute that.

To impute the missing values in new_age, I dropped the ground truth "age", and skipped imputation for life boat and cabin.

YY

Maerkli

Sorry, YY, I missed to specify Generate Attributes (2).

YYH

Oh, the generate attribute(2) did the similar on testing set. Only difference is that new_age will be missing, i.e. zero divided by zero. This is a trick to generate missing value (?) in RapidMiner.

Image: https://us.v-cdn.net/6030995/uploads/editor/ku/honejthcklzi.jpg

Maerkli

Encore merci.

Maerkli

MoWei

Hey @yyhuang,

many thanks for your answer.

Correct. After your XML code, I now have one way to remove missing values, along with an evaluation (RMSE). But now I would like to use two more methods to see if they are better than the kNN algorithm. At the end I want to select the best method to derive the missing values depending on the best RMSE value. I don't currently know how to do this.

Can you help me?

BR

Moritz

YYH

Hi @MoWei,

You can easily replicate the imputation by other machine learning algorithms. Check out this

<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.1.000" expanded="true" height="68" name="Retrieve Titanic" width="90" x="112" y="34">
        <parameter key="repository_entry" value="//Samples/data/Titanic"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples" width="90" x="246" y="34">
        <parameter key="parameter_expression" value=""/>
        <parameter key="condition_class" value="custom_filters"/>
        <parameter key="invert_filter" value="false"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="Age.is_not_missing."/>
        </list>
        <parameter key="filters_logic_and" value="true"/>
        <parameter key="filters_check_metadata" value="true"/>
        <description align="center" color="transparent" colored="false" width="126">use the data with non-missing age to validate the knn imputation methods</description>
      </operator>
      <operator activated="true" class="split_data" compatibility="9.1.000" expanded="true" height="103" name="Split Data" width="90" x="380" y="85">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.8"/>
          <parameter key="ratio" value="0.2"/>
        </enumeration>
        <parameter key="sampling_type" value="automatic"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <description align="center" color="transparent" colored="false" width="126">80% for training set, 20% for testing</description>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="9.1.000" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="514" y="136">
        <list key="function_descriptions">
          <parameter key="new_age" value="0/0"/>
          <parameter key="class" value="&quot;test&quot;"/>
        </list>
        <parameter key="keep_all" value="true"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="9.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="34">
        <list key="function_descriptions">
          <parameter key="new_age" value="Age"/>
          <parameter key="class" value="&quot;train&quot;"/>
        </list>
        <parameter key="keep_all" value="true"/>
      </operator>
      <operator activated="true" class="append" compatibility="9.1.000" expanded="true" height="103" name="Append" width="90" x="648" y="85">
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="merge_type" value="all"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Set Role" width="90" x="782" y="85">
        <parameter key="attribute_name" value="Age"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles">
          <parameter key="Name" value="NAME"/>
          <parameter key="Ticket Number" value="TICKET"/>
        </list>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.1.000" expanded="true" height="145" name="Multiply" width="90" x="916" y="85"/>
      <operator activated="true" class="impute_missing_values" compatibility="9.1.000" expanded="true" height="68" name="Impute Missing Values" width="90" x="1117" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="new_age"/>
        <parameter key="attributes" value="Age|Cabin|Life Boat"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="iterate" value="true"/>
        <parameter key="learn_on_complete_cases" value="true"/>
        <parameter key="order" value="chronological"/>
        <parameter key="sort" value="ascending"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <process expanded="true">
          <operator activated="true" class="k_nn" compatibility="9.1.000" expanded="true" height="82" name="k-NN" width="90" x="246" y="34">
            <parameter key="k" value="5"/>
            <parameter key="weighted_vote" value="true"/>
            <parameter key="measure_types" value="MixedMeasures"/>
            <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
            <parameter key="nominal_measure" value="NominalDistance"/>
            <parameter key="numerical_measure" value="EuclideanDistance"/>
            <parameter key="divergence" value="GeneralizedIDivergence"/>
            <parameter key="kernel_type" value="radial"/>
            <parameter key="kernel_gamma" value="1.0"/>
            <parameter key="kernel_sigma1" value="1.0"/>
            <parameter key="kernel_sigma2" value="0.0"/>
            <parameter key="kernel_sigma3" value="2.0"/>
            <parameter key="kernel_degree" value="3.0"/>
            <parameter key="kernel_shift" value="1.0"/>
            <parameter key="kernel_a" value="1.0"/>
            <parameter key="kernel_b" value="0.0"/>
          </operator>
          <connect from_port="example set source" to_op="k-NN" to_port="training set"/>
          <connect from_op="k-NN" from_port="model" to_port="model sink"/>
          <portSpacing port="source_example set source" spacing="0"/>
          <portSpacing port="sink_model sink" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">apply imputation on the missing values with KNN</description>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples (2)" width="90" x="1251" y="34">
        <parameter key="parameter_expression" value=""/>
        <parameter key="condition_class" value="custom_filters"/>
        <parameter key="invert_filter" value="false"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="class.equals.test"/>
        </list>
        <parameter key="filters_logic_and" value="true"/>
        <parameter key="filters_check_metadata" value="true"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Impute by KNN" width="90" x="1452" y="34">
        <parameter key="attribute_name" value="new_age"/>
        <parameter key="target_role" value="prediction"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="performance_regression" compatibility="9.1.000" expanded="true" height="82" name="Performance: KNN" width="90" x="1586" y="34">
        <parameter key="main_criterion" value="first"/>
        <parameter key="root_mean_squared_error" value="true"/>
        <parameter key="absolute_error" value="true"/>
        <parameter key="relative_error" value="true"/>
        <parameter key="relative_error_lenient" value="false"/>
        <parameter key="relative_error_strict" value="true"/>
        <parameter key="normalized_absolute_error" value="false"/>
        <parameter key="root_relative_squared_error" value="false"/>
        <parameter key="squared_error" value="true"/>
        <parameter key="correlation" value="true"/>
        <parameter key="squared_correlation" value="true"/>
        <parameter key="prediction_average" value="true"/>
        <parameter key="spearman_rho" value="true"/>
        <parameter key="kendall_tau" value="true"/>
        <parameter key="skip_undefined_labels" value="true"/>
        <parameter key="use_example_weights" value="true"/>
        <description align="center" color="transparent" colored="false" width="126">check performance e.g. RMSE on testing</description>
      </operator>
      <operator activated="true" class="impute_missing_values" compatibility="9.1.000" expanded="true" height="68" name="Impute Missing Values (2)" width="90" x="1117" y="289">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="new_age"/>
        <parameter key="attributes" value="Age|Cabin|Life Boat"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="iterate" value="true"/>
        <parameter key="learn_on_complete_cases" value="true"/>
        <parameter key="order" value="chronological"/>
        <parameter key="sort" value="ascending"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <process expanded="true">
          <operator activated="true" class="h2o:gradient_boosted_trees" compatibility="9.0.000" expanded="true" height="103" name="Gradient Boosted Trees" width="90" x="313" y="34">
            <parameter key="number_of_trees" value="20"/>
            <parameter key="reproducible" value="false"/>
            <parameter key="maximum_number_of_threads" value="4"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="maximal_depth" value="5"/>
            <parameter key="min_rows" value="10.0"/>
            <parameter key="min_split_improvement" value="0.0"/>
            <parameter key="number_of_bins" value="20"/>
            <parameter key="learning_rate" value="0.1"/>
            <parameter key="sample_rate" value="1.0"/>
            <parameter key="distribution" value="AUTO"/>
            <parameter key="early_stopping" value="false"/>
            <parameter key="stopping_rounds" value="1"/>
            <parameter key="stopping_metric" value="AUTO"/>
            <parameter key="stopping_tolerance" value="0.001"/>
            <parameter key="max_runtime_seconds" value="0"/>
            <list key="expert_parameters"/>
          </operator>
          <operator activated="false" class="h2o:deep_learning" compatibility="9.0.000" expanded="true" height="82" name="Deep Learning" width="90" x="313" y="289">
            <parameter key="activation" value="Rectifier"/>
            <enumeration key="hidden_layer_sizes">
              <parameter key="hidden_layer_sizes" value="50"/>
              <parameter key="hidden_layer_sizes" value="50"/>
            </enumeration>
            <enumeration key="hidden_dropout_ratios"/>
            <parameter key="reproducible_(uses_1_thread)" value="false"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="epochs" value="10.0"/>
            <parameter key="compute_variable_importances" value="false"/>
            <parameter key="train_samples_per_iteration" value="-2"/>
            <parameter key="adaptive_rate" value="true"/>
            <parameter key="epsilon" value="1.0E-8"/>
            <parameter key="rho" value="0.99"/>
            <parameter key="learning_rate" value="0.005"/>
            <parameter key="learning_rate_annealing" value="1.0E-6"/>
            <parameter key="learning_rate_decay" value="1.0"/>
            <parameter key="momentum_start" value="0.0"/>
            <parameter key="momentum_ramp" value="1000000.0"/>
            <parameter key="momentum_stable" value="0.0"/>
            <parameter key="nesterov_accelerated_gradient" value="true"/>
            <parameter key="standardize" value="true"/>
            <parameter key="L1" value="1.0E-5"/>
            <parameter key="L2" value="0.0"/>
            <parameter key="max_w2" value="10.0"/>
            <parameter key="loss_function" value="Automatic"/>
            <parameter key="distribution_function" value="AUTO"/>
            <parameter key="early_stopping" value="false"/>
            <parameter key="stopping_rounds" value="1"/>
            <parameter key="stopping_metric" value="AUTO"/>
            <parameter key="stopping_tolerance" value="0.001"/>
            <parameter key="missing_values_handling" value="MeanImputation"/>
            <parameter key="max_runtime_seconds" value="0"/>
            <list key="expert_parameters"/>
            <list key="expert_parameters_"/>
          </operator>
          <connect from_port="example set source" to_op="Gradient Boosted Trees" to_port="training set"/>
          <connect from_op="Gradient Boosted Trees" from_port="model" to_port="model sink"/>
          <portSpacing port="source_example set source" spacing="0"/>
          <portSpacing port="sink_model sink" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">apply imputation on the missing values with GBT</description>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples (3)" width="90" x="1251" y="289">
        <parameter key="parameter_expression" value=""/>
        <parameter key="condition_class" value="custom_filters"/>
        <parameter key="invert_filter" value="false"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="class.equals.test"/>
        </list>
        <parameter key="filters_logic_and" value="true"/>
        <parameter key="filters_check_metadata" value="true"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Impute by GBT" width="90" x="1452" y="289">
        <parameter key="attribute_name" value="new_age"/>
        <parameter key="target_role" value="prediction"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="performance_regression" compatibility="9.1.000" expanded="true" height="82" name="Performance: GBT" width="90" x="1586" y="289">
        <parameter key="main_criterion" value="first"/>
        <parameter key="root_mean_squared_error" value="true"/>
        <parameter key="absolute_error" value="true"/>
        <parameter key="relative_error" value="true"/>
        <parameter key="relative_error_lenient" value="false"/>
        <parameter key="relative_error_strict" value="true"/>
        <parameter key="normalized_absolute_error" value="false"/>
        <parameter key="root_relative_squared_error" value="false"/>
        <parameter key="squared_error" value="true"/>
        <parameter key="correlation" value="true"/>
        <parameter key="squared_correlation" value="true"/>
        <parameter key="prediction_average" value="true"/>
        <parameter key="spearman_rho" value="true"/>
        <parameter key="kendall_tau" value="true"/>
        <parameter key="skip_undefined_labels" value="true"/>
        <parameter key="use_example_weights" value="true"/>
        <description align="center" color="transparent" colored="false" width="126">check performance e.g. RMSE on testing</description>
      </operator>
      <operator activated="true" class="impute_missing_values" compatibility="9.1.000" expanded="true" height="68" name="Impute Missing Values (3)" width="90" x="1117" y="493">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="new_age"/>
        <parameter key="attributes" value="Age|Cabin|Life Boat"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="iterate" value="true"/>
        <parameter key="learn_on_complete_cases" value="true"/>
        <parameter key="order" value="chronological"/>
        <parameter key="sort" value="ascending"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <process expanded="true">
          <operator activated="true" class="h2o:generalized_linear_model" compatibility="9.0.000" expanded="true" height="124" name="Generalized Linear Model (2)" width="90" x="313" y="34">
            <parameter key="family" value="AUTO"/>
            <parameter key="link" value="family_default"/>
            <parameter key="solver" value="AUTO"/>
            <parameter key="reproducible" value="false"/>
            <parameter key="maximum_number_of_threads" value="4"/>
            <parameter key="use_regularization" value="true"/>
            <parameter key="lambda_search" value="false"/>
            <parameter key="number_of_lambdas" value="0"/>
            <parameter key="lambda_min_ratio" value="0.0"/>
            <parameter key="early_stopping" value="true"/>
            <parameter key="stopping_rounds" value="3"/>
            <parameter key="stopping_tolerance" value="0.001"/>
            <parameter key="standardize" value="true"/>
            <parameter key="non-negative_coefficients" value="false"/>
            <parameter key="add_intercept" value="true"/>
            <parameter key="compute_p-values" value="false"/>
            <parameter key="remove_collinear_columns" value="false"/>
            <parameter key="missing_values_handling" value="MeanImputation"/>
            <parameter key="max_iterations" value="0"/>
            <parameter key="specify_beta_constraints" value="false"/>
            <list key="beta_constraints"/>
            <parameter key="max_runtime_seconds" value="0"/>
            <list key="expert_parameters"/>
          </operator>
          <connect from_port="example set source" to_op="Generalized Linear Model (2)" to_port="training set"/>
          <connect from_op="Generalized Linear Model (2)" from_port="model" to_port="model sink"/>
          <portSpacing port="source_example set source" spacing="0"/>
          <portSpacing port="sink_model sink" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">apply imputation on the missing values with GLM</description>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples (4)" width="90" x="1251" y="493">
        <parameter key="parameter_expression" value=""/>
        <parameter key="condition_class" value="custom_filters"/>
        <parameter key="invert_filter" value="false"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="class.equals.test"/>
        </list>
        <parameter key="filters_logic_and" value="true"/>
        <parameter key="filters_check_metadata" value="true"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Impute by GLM" width="90" x="1452" y="493">
        <parameter key="attribute_name" value="new_age"/>
        <parameter key="target_role" value="prediction"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="performance_regression" compatibility="9.1.000" expanded="true" height="82" name="Performance: GLM" width="90" x="1586" y="493">
        <parameter key="main_criterion" value="first"/>
        <parameter key="root_mean_squared_error" value="true"/>
        <parameter key="absolute_error" value="true"/>
        <parameter key="relative_error" value="true"/>
        <parameter key="relative_error_lenient" value="false"/>
        <parameter key="relative_error_strict" value="true"/>
        <parameter key="normalized_absolute_error" value="false"/>
        <parameter key="root_relative_squared_error" value="false"/>
        <parameter key="squared_error" value="true"/>
        <parameter key="correlation" value="true"/>
        <parameter key="squared_correlation" value="true"/>
        <parameter key="prediction_average" value="true"/>
        <parameter key="spearman_rho" value="true"/>
        <parameter key="kendall_tau" value="true"/>
        <parameter key="skip_undefined_labels" value="true"/>
        <parameter key="use_example_weights" value="true"/>
        <description align="center" color="transparent" colored="false" width="126">check performance e.g. RMSE on testing</description>
      </operator>
      <operator activated="true" class="impute_missing_values" compatibility="9.1.000" expanded="true" height="68" name="Impute Missing Values (4)" width="90" x="1117" y="697">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="new_age"/>
        <parameter key="attributes" value="Age|Cabin|Life Boat"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="iterate" value="true"/>
        <parameter key="learn_on_complete_cases" value="true"/>
        <parameter key="order" value="chronological"/>
        <parameter key="sort" value="ascending"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <process expanded="true">
          <operator activated="true" class="h2o:deep_learning" compatibility="9.0.000" expanded="true" height="82" name="Deep Learning (2)" width="90" x="447" y="34">
            <parameter key="activation" value="Rectifier"/>
            <enumeration key="hidden_layer_sizes">
              <parameter key="hidden_layer_sizes" value="50"/>
              <parameter key="hidden_layer_sizes" value="50"/>
            </enumeration>
            <enumeration key="hidden_dropout_ratios"/>
            <parameter key="reproducible_(uses_1_thread)" value="false"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="epochs" value="10.0"/>
            <parameter key="compute_variable_importances" value="false"/>
            <parameter key="train_samples_per_iteration" value="-2"/>
            <parameter key="adaptive_rate" value="true"/>
            <parameter key="epsilon" value="1.0E-8"/>
            <parameter key="rho" value="0.99"/>
            <parameter key="learning_rate" value="0.005"/>
            <parameter key="learning_rate_annealing" value="1.0E-6"/>
            <parameter key="learning_rate_decay" value="1.0"/>
            <parameter key="momentum_start" value="0.0"/>
            <parameter key="momentum_ramp" value="1000000.0"/>
            <parameter key="momentum_stable" value="0.0"/>
            <parameter key="nesterov_accelerated_gradient" value="true"/>
            <parameter key="standardize" value="true"/>
            <parameter key="L1" value="1.0E-5"/>
            <parameter key="L2" value="0.0"/>
            <parameter key="max_w2" value="10.0"/>
            <parameter key="loss_function" value="Automatic"/>
            <parameter key="distribution_function" value="AUTO"/>
            <parameter key="early_stopping" value="false"/>
            <parameter key="stopping_rounds" value="1"/>
            <parameter key="stopping_metric" value="AUTO"/>
            <parameter key="stopping_tolerance" value="0.001"/>
            <parameter key="missing_values_handling" value="MeanImputation"/>
            <parameter key="max_runtime_seconds" value="0"/>
            <list key="expert_parameters"/>
            <list key="expert_parameters_"/>
          </operator>
          <connect from_port="example set source" to_op="Deep Learning (2)" to_port="training set"/>
          <connect from_op="Deep Learning (2)" from_port="model" to_port="model sink"/>
          <portSpacing port="source_example set source" spacing="0"/>
          <portSpacing port="sink_model sink" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">apply imputation on the missing values with DL</description>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples (5)" width="90" x="1251" y="697">
        <parameter key="parameter_expression" value=""/>
        <parameter key="condition_class" value="custom_filters"/>
        <parameter key="invert_filter" value="false"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="class.equals.test"/>
        </list>
        <parameter key="filters_logic_and" value="true"/>
        <parameter key="filters_check_metadata" value="true"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Impute by DL" width="90" x="1452" y="697">
        <parameter key="attribute_name" value="new_age"/>
        <parameter key="target_role" value="prediction"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="performance_regression" compatibility="9.1.000" expanded="true" height="82" name="Performance: DL" width="90" x="1586" y="697">
        <parameter key="main_criterion" value="first"/>
        <parameter key="root_mean_squared_error" value="true"/>
        <parameter key="absolute_error" value="true"/>
        <parameter key="relative_error" value="true"/>
        <parameter key="relative_error_lenient" value="false"/>
        <parameter key="relative_error_strict" value="true"/>
        <parameter key="normalized_absolute_error" value="false"/>
        <parameter key="root_relative_squared_error" value="false"/>
        <parameter key="squared_error" value="true"/>
        <parameter key="correlation" value="true"/>
        <parameter key="squared_correlation" value="true"/>
        <parameter key="prediction_average" value="true"/>
        <parameter key="spearman_rho" value="true"/>
        <parameter key="kendall_tau" value="true"/>
        <parameter key="skip_undefined_labels" value="true"/>
        <parameter key="use_example_weights" value="true"/>
        <description align="center" color="transparent" colored="false" width="126">check performance e.g. RMSE on testing</description>
      </operator>
      <connect from_op="Retrieve Titanic" from_port="output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Split Data" from_port="partition 2" to_op="Generate Attributes (2)" to_port="example set input"/>
      <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Append" to_port="example set 2"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Impute Missing Values" to_port="example set in"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Impute Missing Values (2)" to_port="example set in"/>
      <connect from_op="Multiply" from_port="output 3" to_op="Impute Missing Values (3)" to_port="example set in"/>
      <connect from_op="Multiply" from_port="output 4" to_op="Impute Missing Values (4)" to_port="example set in"/>
      <connect from_op="Impute Missing Values" from_port="example set out" to_op="Filter Examples (2)" to_port="example set input"/>
      <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Impute by KNN" to_port="example set input"/>
      <connect from_op="Impute by KNN" from_port="example set output" to_op="Performance: KNN" to_port="labelled data"/>
      <connect from_op="Performance: KNN" from_port="performance" to_port="result 1"/>
      <connect from_op="Performance: KNN" from_port="example set" to_port="result 2"/>
      <connect from_op="Impute Missing Values (2)" from_port="example set out" to_op="Filter Examples (3)" to_port="example set input"/>
      <connect from_op="Filter Examples (3)" from_port="example set output" to_op="Impute by GBT" to_port="example set input"/>
      <connect from_op="Impute by GBT" from_port="example set output" to_op="Performance: GBT" to_port="labelled data"/>
      <connect from_op="Performance: GBT" from_port="performance" to_port="result 3"/>
      <connect from_op="Performance: GBT" from_port="example set" to_port="result 4"/>
      <connect from_op="Impute Missing Values (3)" from_port="example set out" to_op="Filter Examples (4)" to_port="example set input"/>
      <connect from_op="Filter Examples (4)" from_port="example set output" to_op="Impute by GLM" to_port="example set input"/>
      <connect from_op="Impute by GLM" from_port="example set output" to_op="Performance: GLM" to_port="labelled data"/>
      <connect from_op="Performance: GLM" from_port="performance" to_port="result 5"/>
      <connect from_op="Performance: GLM" from_port="example set" to_port="result 6"/>
      <connect from_op="Impute Missing Values (4)" from_port="example set out" to_op="Filter Examples (5)" to_port="example set input"/>
      <connect from_op="Filter Examples (5)" from_port="example set output" to_op="Impute by DL" to_port="example set input"/>
      <connect from_op="Impute by DL" from_port="example set output" to_op="Performance: DL" to_port="labelled data"/>
      <connect from_op="Performance: DL" from_port="performance" to_port="result 7"/>
      <connect from_op="Performance: DL" from_port="example set" to_port="result 8"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="210"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="168"/>
      <portSpacing port="sink_result 6" spacing="0"/>
      <portSpacing port="sink_result 7" spacing="147"/>
      <portSpacing port="sink_result 8" spacing="0"/>
      <portSpacing port="sink_result 9" spacing="0"/>
    </process>
  </operator>
</process>

HTH!

Telcontar120

This is an interesting problem to try to solve, but you do want to be careful about over-engineering in this context as well. Since these are missing values, my assumption is that you want to replace missing values so that later you can use these as inputs to another predictive model where you are trying to predict a different label using the attributes with missing values as your inputs. However, sometimes the fact that the values are missing is itself useful information for the model. Also, the proportion of missing values can make a difference---is it just a handful of records, or do some attributes have a large percentage of missings. You might ultimately want to test your methods by using ML algorithms that allow missing values (e.g., most tree-based methods) and/or by segmenting the data and creating separate scorecards for missing vs non-missing examples. These can both be compared to a model in which you have included all attributes after filling in missings with one of the methods above that you have selected.

MartinLiebig

Hi,

Please keep also in mind, that you are transfering information from the example set into all examples. This means, that your examples are not necessarly independed from each other anymore. This needs to be taken into account carefully, to not trick yourself in validation.

Best,

Martin

MoWei

Hello again,

thank you very much for your answers.

@yyhuang: Thanks for your work. Even if your approach doesn't do everything I would like to do, it helps me a lot. With my dataset the GBL, GLM and DL algorithm don not work. I always get the error message as shown on the screenshot. Can you tell me what the problem is? (To be honest, I don't know exactly how the algorithms work, I hope that's not mandatory) Do you have any idea how I can integrate linear interpolation or quadratic interpolation into the system?

To everybody:

The actual plan in dealing with missing values was as follows:

Attributes 1

- is label

- delete all examples that have no value for attribute 1 (missing value)

- split remaining data into training data and test data

Method 1

- Predict values of attributes 1

- Evaluate by comparing existing and predicted values

Method 2

- Predict values of attributes 1

- Evaluate by comparing existing and predicted values

Method 3

- Predict values of attributes 1

- Evaluate by comparing existing and predicted values

- Remember the method with the best result (lowest RMSE or highest Accuracy or ???).

Attributes 2

- is label

- delete all examples that have no value for attribute 2 (missing value)

- split remaining data into training data and test data

Method 1

- Predict values of attributes 1

- Evaluate by comparing existing and predicted values

Method 2

- Predict values of attributes 1

- Evaluate by comparing existing and predicted values

Method 3

- Predict values of attributes 1

- Evaluate by comparing existing and predicted values

- Remember the method with the best result (lowest RMSE or highest Accuracy or ???).

Attributes 3

and so on

- the missing values are imputed at the very end with the help of the respective best methods. This means that the system first searches for the best method for each attribute and then fills the gaps.

What do you think of this plan?

Is there a general approach for pre-processing somewhere? How do I handle a data set from a production machine (almost everything is sensor data)? All existing data is in a final dataset, i.e. the data integration is finished. So the data cleaning (missing values and noise), transformation and data reduction (feature selection and example selection) are still missing. In which order should I do what? How do I proceed in general? Unfortunately I haven't found anything yet and would be very grateful for literature or other help.

Thank you very much.

BG

Moritz

YYH

Hi @MoWei,

Thanks for sharing the screenshot. If there is any nominal ID-like attribute in the input, the H2O model will have errors. That 's why I dropped some columns that is ID-like with "set role" before the imputation. You can apply invert-selection of some attributes to troubleshoot. A quick check on the input node of the deep learning learner by right-clicking the node and view example set could be useful. You can also share your data and process here for us to further investigate.

YY