How to fill missing values, based on other records in my dataset?
sebasvog
New Altair Community Member
Hello everyone,
I am working on data preperation for training a ML algorithm on a classification problem.
In my dataset (~23000 records) some values (dates) are missing. The dataset looks like in the screenshot below:
Every row is a position on a particular customer order. Every order has a unique Number (column A). Shipping and Invoice Date are only displayed on the last position of the order.
How can i fill the missing values with the matching data (based on "Sales Doc")? The filled dataset should look like this:
I am new in Rapidminer and I would be very thankfull if someone can help me out with this.
Thank you.
I am working on data preperation for training a ML algorithm on a classification problem.
In my dataset (~23000 records) some values (dates) are missing. The dataset looks like in the screenshot below:
Every row is a position on a particular customer order. Every order has a unique Number (column A). Shipping and Invoice Date are only displayed on the last position of the order.
How can i fill the missing values with the matching data (based on "Sales Doc")? The filled dataset should look like this:
I am new in Rapidminer and I would be very thankfull if someone can help me out with this.
Thank you.
Tagged:
0
Best Answer
-
Hi @sebasvog you can do it by using the replace missing values (Series) operator since you are working with dates.Configure the replace type date time to the option next value as shown in my example.
You could also create a data set with the Order and the dates you are interested and then join them with the original data set- Select Attributes: Order and Dates
- Filter Examples: Not missing on the Dates you want
- Remove Duplicates: just in case you have some
- Join: Join by Order Id with the previous DataSet
<?xml version="1.0" encoding="UTF-8"?><process version="9.7.002"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.7.002" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.7.002" expanded="true" height="68" name="Create ExampleSet" width="90" x="313" y="85"> <parameter key="generator_type" value="comma separated text"/> <parameter key="number_of_examples" value="100"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"/> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="input_csv_text" value="Order,Date A, A, A, A, A, A, A,30/09/2020 B, B, B, B,25/09/2020"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="nominal_to_date" compatibility="9.7.002" expanded="true" height="82" name="Nominal to Date" width="90" x="447" y="85"> <parameter key="attribute_name" value="Date"/> <parameter key="date_type" value="date"/> <parameter key="date_format" value="dd/MM/yyyy"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="keep_old_attribute" value="false"/> </operator> <operator activated="true" class="time_series:replace_missing_values" compatibility="9.8.000-SNAPSHOT" expanded="true" height="68" name="Replace Missing Values (Series)" width="90" x="581" y="85"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="Date"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="has_indices" value="false"/> <parameter key="indices_attribute" value=""/> <parameter key="overwrite_attributes" value="true"/> <parameter key="new_attributes_postfix" value="_cleaned"/> <parameter key="replace_type_numerical" value="previous value"/> <parameter key="replace_type_nominal" value="previous value"/> <parameter key="replace_type_date_time" value="next value"/> <parameter key="replace_value_numerical" value="0.0"/> <parameter key="replace_value_nominal" value="unknown"/> <parameter key="skip_other_missings" value="true"/> <parameter key="replace_infinity" value="true"/> <parameter key="replace_empty_strings" value="true"/> <parameter key="ensure_finite_values" value="false"/> </operator> <connect from_op="Create ExampleSet" from_port="output" to_op="Nominal to Date" to_port="example set input"/> <connect from_op="Nominal to Date" from_port="example set output" to_op="Replace Missing Values (Series)" to_port="example set"/> <connect from_op="Replace Missing Values (Series)" from_port="example set" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
5
Answers
-
Hi @sebasvog you can do it by using the replace missing values (Series) operator since you are working with dates.Configure the replace type date time to the option next value as shown in my example.
You could also create a data set with the Order and the dates you are interested and then join them with the original data set- Select Attributes: Order and Dates
- Filter Examples: Not missing on the Dates you want
- Remove Duplicates: just in case you have some
- Join: Join by Order Id with the previous DataSet
<?xml version="1.0" encoding="UTF-8"?><process version="9.7.002"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.7.002" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.7.002" expanded="true" height="68" name="Create ExampleSet" width="90" x="313" y="85"> <parameter key="generator_type" value="comma separated text"/> <parameter key="number_of_examples" value="100"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"/> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="input_csv_text" value="Order,Date A, A, A, A, A, A, A,30/09/2020 B, B, B, B,25/09/2020"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="nominal_to_date" compatibility="9.7.002" expanded="true" height="82" name="Nominal to Date" width="90" x="447" y="85"> <parameter key="attribute_name" value="Date"/> <parameter key="date_type" value="date"/> <parameter key="date_format" value="dd/MM/yyyy"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="keep_old_attribute" value="false"/> </operator> <operator activated="true" class="time_series:replace_missing_values" compatibility="9.8.000-SNAPSHOT" expanded="true" height="68" name="Replace Missing Values (Series)" width="90" x="581" y="85"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="Date"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="has_indices" value="false"/> <parameter key="indices_attribute" value=""/> <parameter key="overwrite_attributes" value="true"/> <parameter key="new_attributes_postfix" value="_cleaned"/> <parameter key="replace_type_numerical" value="previous value"/> <parameter key="replace_type_nominal" value="previous value"/> <parameter key="replace_type_date_time" value="next value"/> <parameter key="replace_value_numerical" value="0.0"/> <parameter key="replace_value_nominal" value="unknown"/> <parameter key="skip_other_missings" value="true"/> <parameter key="replace_infinity" value="true"/> <parameter key="replace_empty_strings" value="true"/> <parameter key="ensure_finite_values" value="false"/> </operator> <connect from_op="Create ExampleSet" from_port="output" to_op="Nominal to Date" to_port="example set input"/> <connect from_op="Nominal to Date" from_port="example set output" to_op="Replace Missing Values (Series)" to_port="example set"/> <connect from_op="Replace Missing Values (Series)" from_port="example set" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
5