How long can the process be?
hi, ı have a dataset which includes 5000 rows and 9 columnes. I am trying to do the process with filling the wrong/missing value by average. This process has not been finished. I have waited for at least 1 hour but still not finished. Is it normal? By the way, my computer is a mac pro which was produced in 2014.
Find more posts tagged with
Sort by:
1 - 4 of
41
<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" breakpoints="after" class="retrieve" compatibility="9.1.000" expanded="true" height="68" name="Retrieve adult" width="90" x="112" y="187">
<parameter key="repository_entry" value="//Local Repository/data/adult"/>
</operator>
<operator activated="true" class="subprocess" compatibility="9.1.000" expanded="true" height="103" name="Subprocess" width="90" x="380" y="136">
<process expanded="true">
<operator activated="true" class="replace_missing_values" compatibility="9.1.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="45" y="238">
<parameter key="return_preprocessing_model" value="false"/>
<parameter key="create_view" value="false"/>
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="default" value="average"/>
<list key="columns"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="238">
<parameter key="parameter_expression" value=""/>
<parameter key="condition_class" value="no_missing_labels"/>
<parameter key="invert_filter" value="false"/>
<list key="filters_list"/>
<parameter key="filters_logic_and" value="true"/>
<parameter key="filters_check_metadata" value="true"/>
</operator>
<operator activated="true" class="discretize_by_bins" compatibility="9.1.000" expanded="true" height="103" name="Discretize" width="90" x="246" y="34">
<parameter key="return_preprocessing_model" value="false"/>
<parameter key="create_view" value="false"/>
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value="hours-per-week|education-num|age"/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="numeric"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="real"/>
<parameter key="block_type" value="value_series"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_series_end"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="number_of_bins" value="4"/>
<parameter key="define_boundaries" value="false"/>
<parameter key="range_name_type" value="long"/>
<parameter key="automatic_number_of_digits" value="true"/>
<parameter key="number_of_digits" value="3"/>
</operator>
<operator activated="true" class="detect_outlier_distances" compatibility="9.1.000" expanded="true" height="82" name="Detect Outlier (Distances)" width="90" x="447" y="391">
<parameter key="number_of_neighbors" value="1"/>
<parameter key="number_of_outliers" value="2"/>
<parameter key="distance_function" value="euclidian distance"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples (2)" width="90" x="447" y="238">
<parameter key="parameter_expression" value=""/>
<parameter key="condition_class" value="custom_filters"/>
<parameter key="invert_filter" value="false"/>
<list key="filters_list">
<parameter key="filters_entry_key" value="outlier.does_not_equal.true"/>
</list>
<parameter key="filters_logic_and" value="true"/>
<parameter key="filters_check_metadata" value="true"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.1.000" expanded="true" height="82" name="Multiply" width="90" x="447" y="34"/>
<operator activated="true" class="select_attributes" compatibility="9.1.000" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="85">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="outlier"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="weight_by_information_gain" compatibility="9.1.000" expanded="true" height="82" name="Weight by Information Gain" width="90" x="581" y="238">
<parameter key="normalize_weights" value="true"/>
<parameter key="sort_weights" value="true"/>
<parameter key="sort_direction" value="descending"/>
</operator>
<operator activated="true" class="select_by_weights" compatibility="9.1.000" expanded="true" height="103" name="Select by Weights" width="90" x="715" y="238">
<parameter key="weight_relation" value="top k"/>
<parameter key="weight" value="1.0"/>
<parameter key="k" value="5"/>
<parameter key="p" value="0.5"/>
<parameter key="deselect_unknown" value="true"/>
<parameter key="use_absolute_weights" value="true"/>
</operator>
<connect from_port="in 1" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Discretize" to_port="example set input"/>
<connect from_op="Discretize" from_port="example set output" to_op="Detect Outlier (Distances)" to_port="example set input"/>
<connect from_op="Detect Outlier (Distances)" from_port="example set output" to_op="Filter Examples (2)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Weight by Information Gain" to_port="example set"/>
<connect from_op="Weight by Information Gain" from_port="weights" to_op="Select by Weights" to_port="weights"/>
<connect from_op="Weight by Information Gain" from_port="example set" to_op="Select by Weights" to_port="example set input"/>
<connect from_op="Select by Weights" from_port="example set output" to_port="out 1"/>
<connect from_op="Select by Weights" from_port="weights" to_port="out 2"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
<portSpacing port="sink_out 3" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve adult" from_port="output" to_op="Subprocess" to_port="in 1"/>
<connect from_op="Subprocess" from_port="out 1" to_port="result 1"/>
<connect from_op="Subprocess" from_port="out 2" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Hi @Newplayer - so I looked at your XML and there is nothing wrong from what I can see without your data set. But why not just try it with a reduced number of attributes and see how long that takes first?
Scott
Scott
Hi,
I also tested your process with some similar testing data. Your set up looks good. What takes so long is the outlier detection, as it has to compare each combination of points.
Take a look at the "Anomaly Detection" extension on the marketplace. There are several more performant algorithms available. The only change you have to include is, that there you most often get an outlier score ("how outlier-ish is that point") and not a binary decision (outlier =yes/no).
I also tested your process with some similar testing data. Your set up looks good. What takes so long is the outlier detection, as it has to compare each combination of points.
Take a look at the "Anomaly Detection" extension on the marketplace. There are several more performant algorithms available. The only change you have to include is, that there you most often get an outlier score ("how outlier-ish is that point") and not a binary decision (outlier =yes/no).
Hope this helps.