Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

optimize k in k-nn

I am using k-nn algorithm for predict categories of some product with test mining on customers comments.

I have a Q about optimize parameter in classification using k-nn algorithm,

I want to optimize the K with "optimize parameters" and "log" operators for best accuracy in performance,

but I have 2 performance operator in my process below and i don't know where should I put optimize parameters and log process?

and need a help for it

-neginz

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="8.2.001" expanded="true" height="68" name="Retrieve appended-data-eng" width="90" x="45" y="34">
        <parameter key="repository_entry" value="../../data/Digikala-Data/appended-data-eng"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="8.2.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Weakness|Strengths|Content|Title"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Content|Strengths|Title|Weakness|Comment id|Category"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="8.2.001" expanded="true" height="82" name="Set Role" width="90" x="514" y="34">
        <parameter key="attribute_name" value="Category"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="8.2.001" expanded="true" height="103" name="Split Data" width="90" x="112" y="289">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.7"/>
          <parameter key="ratio" value="0.3"/>
        </enumeration>
        <parameter key="sampling_type" value="shuffled sampling"/>
        <parameter key="use_local_random_seed" value="true"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="238">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_below_percent" value="0.0"/>
        <parameter key="prune_above_percent" value="100.0"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="9999"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
          <operator activated="false" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="112" y="136">
            <parameter key="mode" value="linguistic sentences"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="246" y="34"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="380" y="34"/>
          <operator activated="false" class="text:filter_stopwords_dictionary" compatibility="8.1.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="514" y="136">
            <parameter key="file" value="E:\payan name\dataminer exention\amazon\New Text Document.txt"/>
          </operator>
          <operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="648" y="34">
            <parameter key="min_chars" value="2"/>
            <parameter key="max_chars" value="9999"/>
          </operator>
          <operator activated="true" class="text:stem_porter" compatibility="8.1.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="782" y="34"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="916" y="34">
            <parameter key="max_length" value="3"/>
          </operator>
          <operator activated="false" class="text:stem_snowball" compatibility="8.1.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="648" y="340"/>
          <operator activated="false" class="text:stem_lovins" compatibility="8.1.000" expanded="true" height="68" name="Stem (Lovins)" width="90" x="246" y="595"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
          <connect from_op="Stem (Porter)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="8.2.000" expanded="true" height="145" name="Cross Validation" width="90" x="514" y="187">
        <process expanded="true">
          <operator activated="true" class="k_nn" compatibility="8.2.001" expanded="true" height="82" name="k-NN" width="90" x="112" y="34">
            <parameter key="k" value="7"/>
          </operator>
          <connect from_port="training set" to_op="k-NN" to_port="training set"/>
          <connect from_op="k-NN" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="8.2.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance_classification" compatibility="8.2.001" expanded="true" height="82" name="Performance" width="90" x="313" y="34">
            <list key="class_weights"/>
          </operator>
          <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="free_memory" compatibility="8.2.001" expanded="true" height="82" name="Free Memory" width="90" x="648" y="289"/>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="380" y="442">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_below_percent" value="0.0"/>
        <parameter key="prune_above_percent" value="100.0"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="9999"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (3)" width="90" x="45" y="34"/>
          <operator activated="false" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (4)" width="90" x="112" y="136">
            <parameter key="mode" value="linguistic sentences"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="179" y="34"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="380" y="34"/>
          <operator activated="false" class="text:filter_stopwords_dictionary" compatibility="8.1.000" expanded="true" height="82" name="Filter Stopwords (3)" width="90" x="514" y="136">
            <parameter key="file" value="E:\payan name\dataminer exention\amazon\New Text Document.txt"/>
          </operator>
          <operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="648" y="34">
            <parameter key="min_chars" value="2"/>
            <parameter key="max_chars" value="9999"/>
          </operator>
          <operator activated="true" class="text:stem_porter" compatibility="8.1.000" expanded="true" height="68" name="Stem (2)" width="90" x="782" y="34"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (2)" width="90" x="916" y="34">
            <parameter key="max_length" value="3"/>
          </operator>
          <operator activated="false" class="text:stem_snowball" compatibility="8.1.000" expanded="true" height="68" name="Stem (3)" width="90" x="648" y="340"/>
          <operator activated="false" class="text:stem_lovins" compatibility="8.1.000" expanded="true" height="68" name="Stem (4)" width="90" x="246" y="595"/>
          <operator activated="false" class="text:stem_dictionary" compatibility="8.1.000" expanded="true" height="82" name="Stem (5)" width="90" x="380" y="289"/>
          <connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
          <connect from_op="Tokenize (3)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
          <connect from_op="Stem (2)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
          <connect from_op="Generate n-Grams (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" compatibility="8.2.001" expanded="true" height="82" name="Apply Model" width="90" x="849" y="442">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="8.2.001" expanded="true" height="82" name="Performance (3)" width="90" x="849" y="187">
        <list key="class_weights"/>
      </operator>
      <connect from_op="Retrieve appended-data-eng" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
      <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 2" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Cross Validation" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
      <connect from_op="Cross Validation" from_port="model" to_op="Free Memory" to_port="through 1"/>
      <connect from_op="Free Memory" from_port="through 1" to_op="Apply Model" to_port="model"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 3"/>
      <connect from_op="Performance (3)" from_port="performance" to_port="result 1"/>
      <connect from_op="Performance (3)" from_port="example set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

Find more posts tagged with

AI Studio

Accepted answers

rfuentealba

Hi @neginz,

There is a logic behind this.

What do you want to optimize? The answer is k.

Where is Waldo k? Somewhere inside the Cross Validation operator.

So you want to enclose the Cross Validation inside the Optimize Parameters operator and choose k from the Parameters panel when configuring the parameter optimization.

Please find attached.

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.0.000-BETA" expanded="true" height="68" name="Retrieve appended-data-eng" width="90" x="45" y="34">
        <parameter key="repository_entry" value="../../data/Digikala-Data/appended-data-eng"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="9.0.000-BETA" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Weakness|Strengths|Content|Title"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="9.0.000-BETA" expanded="true" height="82" name="Select Attributes (2)" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Content|Strengths|Title|Weakness|Comment id|Category"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.0.000-BETA" expanded="true" height="82" name="Set Role" width="90" x="514" y="34">
        <parameter key="attribute_name" value="Category"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="9.0.000-BETA" expanded="true" height="103" name="Split Data" width="90" x="112" y="187">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.7"/>
          <parameter key="ratio" value="0.3"/>
        </enumeration>
        <parameter key="sampling_type" value="shuffled sampling"/>
        <parameter key="use_local_random_seed" value="true"/>
      </operator>
      <operator activated="true" class="dummy" compatibility="9.0.000-BETA" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="187"/>
      <operator activated="true" class="dummy" compatibility="9.0.000-BETA" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="380" y="442"/>
      <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="9.0.000-BETA" expanded="true" height="124" name="Optimize Parameters (Grid)" width="90" x="514" y="187">
        <list key="parameters">
          <parameter key="k-NN.k" value="[1.0;100.0;10;linear]"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="concurrency:cross_validation" compatibility="8.2.000" expanded="true" height="145" name="Cross Validation" width="90" x="179" y="34">
            <process expanded="true">
              <operator activated="true" class="k_nn" compatibility="9.0.000-BETA" expanded="true" height="82" name="k-NN" width="90" x="112" y="34">
                <parameter key="k" value="7"/>
              </operator>
              <connect from_port="training set" to_op="k-NN" to_port="training set"/>
              <connect from_op="k-NN" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="9.0.000-BETA" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="34">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="9.0.000-BETA" expanded="true" height="82" name="Performance" width="90" x="313" y="34">
                <list key="class_weights"/>
              </operator>
              <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="input 1" to_op="Cross Validation" to_port="example set"/>
          <connect from_op="Cross Validation" from_port="model" to_port="model"/>
          <connect from_op="Cross Validation" from_port="performance 1" to_port="performance"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="free_memory" compatibility="9.0.000-BETA" expanded="true" height="82" name="Free Memory" width="90" x="648" y="187"/>
      <operator activated="true" class="apply_model" compatibility="9.0.000-BETA" expanded="true" height="82" name="Apply Model" width="90" x="849" y="442">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="9.0.000-BETA" expanded="true" height="82" name="Performance (3)" width="90" x="849" y="187">
        <list key="class_weights"/>
      </operator>
      <connect from_op="Retrieve appended-data-eng" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
      <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Process Documents from Data" to_port="in 1"/>
      <connect from_op="Split Data" from_port="partition 2" to_op="Process Documents from Data (2)" to_port="in 1"/>
      <connect from_op="Process Documents from Data" from_port="out 1" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
      <connect from_op="Process Documents from Data (2)" from_port="out 1" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="model" to_op="Free Memory" to_port="through 1"/>
      <connect from_op="Free Memory" from_port="through 1" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 3"/>
      <connect from_op="Performance (3)" from_port="performance" to_port="result 1"/>
      <connect from_op="Performance (3)" from_port="example set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

Hope it helps.

All the best,

rfuentealba

@neginz,

I am sorry, I forgot about the log parameter.

Inside the Optimize Parameters operator, right after Cross Validation. What you want to log is performance, hence you should put your operator there.

One scatter plot is worth a thousand example sets:

Screen Shot 2018-07-23 at 01.15.30.png

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.0.000-BETA" expanded="true" height="68" name="Retrieve appended-data-eng" width="90" x="45" y="34">
        <parameter key="repository_entry" value="../../data/Digikala-Data/appended-data-eng"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="9.0.000-BETA" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Weakness|Strengths|Content|Title"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="9.0.000-BETA" expanded="true" height="82" name="Select Attributes (2)" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Content|Strengths|Title|Weakness|Comment id|Category"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.0.000-BETA" expanded="true" height="82" name="Set Role" width="90" x="514" y="34">
        <parameter key="attribute_name" value="Category"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="9.0.000-BETA" expanded="true" height="103" name="Split Data" width="90" x="112" y="187">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.7"/>
          <parameter key="ratio" value="0.3"/>
        </enumeration>
        <parameter key="sampling_type" value="shuffled sampling"/>
        <parameter key="use_local_random_seed" value="true"/>
      </operator>
      <operator activated="true" class="dummy" compatibility="9.0.000-BETA" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="187"/>
      <operator activated="true" class="dummy" compatibility="9.0.000-BETA" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="380" y="442"/>
      <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="9.0.000-BETA" expanded="true" height="124" name="Optimize Parameters (Grid)" width="90" x="514" y="187">
        <list key="parameters">
          <parameter key="k-NN.k" value="[1.0;100.0;10;linear]"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="concurrency:cross_validation" compatibility="8.2.000" expanded="true" height="145" name="Cross Validation" width="90" x="179" y="34">
            <process expanded="true">
              <operator activated="true" class="k_nn" compatibility="9.0.000-BETA" expanded="true" height="82" name="k-NN" width="90" x="112" y="34">
                <parameter key="k" value="7"/>
              </operator>
              <connect from_port="training set" to_op="k-NN" to_port="training set"/>
              <connect from_op="k-NN" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="9.0.000-BETA" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="34">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="9.0.000-BETA" expanded="true" height="82" name="Performance" width="90" x="313" y="34">
                <list key="class_weights"/>
              </operator>
              <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="log" compatibility="9.0.000-BETA" expanded="true" height="82" name="Log" width="90" x="380" y="136">
            <list key="log"/>
          </operator>
          <connect from_port="input 1" to_op="Cross Validation" to_port="example set"/>
          <connect from_op="Cross Validation" from_port="model" to_port="model"/>
          <connect from_op="Cross Validation" from_port="performance 1" to_op="Log" to_port="through 1"/>
          <connect from_op="Log" from_port="through 1" to_port="performance"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="free_memory" compatibility="9.0.000-BETA" expanded="true" height="82" name="Free Memory" width="90" x="648" y="187"/>
      <operator activated="true" class="apply_model" compatibility="9.0.000-BETA" expanded="true" height="82" name="Apply Model" width="90" x="849" y="442">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="9.0.000-BETA" expanded="true" height="82" name="Performance (3)" width="90" x="849" y="187">
        <list key="class_weights"/>
      </operator>
      <connect from_op="Retrieve appended-data-eng" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
      <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Process Documents from Data" to_port="in 1"/>
      <connect from_op="Split Data" from_port="partition 2" to_op="Process Documents from Data (2)" to_port="in 1"/>
      <connect from_op="Process Documents from Data" from_port="out 1" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
      <connect from_op="Process Documents from Data (2)" from_port="out 1" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="model" to_op="Free Memory" to_port="through 1"/>
      <connect from_op="Free Memory" from_port="through 1" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 3"/>
      <connect from_op="Performance (3)" from_port="performance" to_port="result 1"/>
      <connect from_op="Performance (3)" from_port="example set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

Rule of thumb, according to CRISP-DM (which is a massive thing):

Understand the business. (You).
Understand the data. (You).
Prepare the data (RapidMiner Studio).
Build the model (RapidMiner Studio).
Evaluate and Optimize the model (RapidMiner Studio).
Deploy the model (RapidMiner Server).

Now:

Model < Validation < Optimization

So the biggest one (optimization) is performed over (validation), and that one should contain a model. You don't want to log the validation but the optimizations, so the Log operator goes after the Cross Validation operator.

Hope it helps.

Screen Shot 2018-07-23 at 01.15.30.png

All comments

rfuentealba

Hi @neginz,

There is a logic behind this.

What do you want to optimize? The answer is k.

Where is Waldo k? Somewhere inside the Cross Validation operator.

So you want to enclose the Cross Validation inside the Optimize Parameters operator and choose k from the Parameters panel when configuring the parameter optimization.

Please find attached.

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.0.000-BETA" expanded="true" height="68" name="Retrieve appended-data-eng" width="90" x="45" y="34">
        <parameter key="repository_entry" value="../../data/Digikala-Data/appended-data-eng"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="9.0.000-BETA" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Weakness|Strengths|Content|Title"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="9.0.000-BETA" expanded="true" height="82" name="Select Attributes (2)" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Content|Strengths|Title|Weakness|Comment id|Category"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.0.000-BETA" expanded="true" height="82" name="Set Role" width="90" x="514" y="34">
        <parameter key="attribute_name" value="Category"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="9.0.000-BETA" expanded="true" height="103" name="Split Data" width="90" x="112" y="187">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.7"/>
          <parameter key="ratio" value="0.3"/>
        </enumeration>
        <parameter key="sampling_type" value="shuffled sampling"/>
        <parameter key="use_local_random_seed" value="true"/>
      </operator>
      <operator activated="true" class="dummy" compatibility="9.0.000-BETA" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="187"/>
      <operator activated="true" class="dummy" compatibility="9.0.000-BETA" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="380" y="442"/>
      <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="9.0.000-BETA" expanded="true" height="124" name="Optimize Parameters (Grid)" width="90" x="514" y="187">
        <list key="parameters">
          <parameter key="k-NN.k" value="[1.0;100.0;10;linear]"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="concurrency:cross_validation" compatibility="8.2.000" expanded="true" height="145" name="Cross Validation" width="90" x="179" y="34">
            <process expanded="true">
              <operator activated="true" class="k_nn" compatibility="9.0.000-BETA" expanded="true" height="82" name="k-NN" width="90" x="112" y="34">
                <parameter key="k" value="7"/>
              </operator>
              <connect from_port="training set" to_op="k-NN" to_port="training set"/>
              <connect from_op="k-NN" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="9.0.000-BETA" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="34">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="9.0.000-BETA" expanded="true" height="82" name="Performance" width="90" x="313" y="34">
                <list key="class_weights"/>
              </operator>
              <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="input 1" to_op="Cross Validation" to_port="example set"/>
          <connect from_op="Cross Validation" from_port="model" to_port="model"/>
          <connect from_op="Cross Validation" from_port="performance 1" to_port="performance"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="free_memory" compatibility="9.0.000-BETA" expanded="true" height="82" name="Free Memory" width="90" x="648" y="187"/>
      <operator activated="true" class="apply_model" compatibility="9.0.000-BETA" expanded="true" height="82" name="Apply Model" width="90" x="849" y="442">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="9.0.000-BETA" expanded="true" height="82" name="Performance (3)" width="90" x="849" y="187">
        <list key="class_weights"/>
      </operator>
      <connect from_op="Retrieve appended-data-eng" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
      <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Process Documents from Data" to_port="in 1"/>
      <connect from_op="Split Data" from_port="partition 2" to_op="Process Documents from Data (2)" to_port="in 1"/>
      <connect from_op="Process Documents from Data" from_port="out 1" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
      <connect from_op="Process Documents from Data (2)" from_port="out 1" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="model" to_op="Free Memory" to_port="through 1"/>
      <connect from_op="Free Memory" from_port="through 1" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 3"/>
      <connect from_op="Performance (3)" from_port="performance" to_port="result 1"/>
      <connect from_op="Performance (3)" from_port="example set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

Hope it helps.

All the best,

neginz

tnx for your replay @rfuentealba

I know about optimize parameters but what about "log" operator where should I put that, and because I have 2 performance I don't know which one should compare with K (the one in cross-validation or the one out of it in the main process) my problem is with the performance ...

rfuentealba

@neginz,

I am sorry, I forgot about the log parameter.

Inside the Optimize Parameters operator, right after Cross Validation. What you want to log is performance, hence you should put your operator there.

One scatter plot is worth a thousand example sets:

Screen Shot 2018-07-23 at 01.15.30.png

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.0.000-BETA" expanded="true" height="68" name="Retrieve appended-data-eng" width="90" x="45" y="34">
        <parameter key="repository_entry" value="../../data/Digikala-Data/appended-data-eng"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="9.0.000-BETA" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Weakness|Strengths|Content|Title"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="9.0.000-BETA" expanded="true" height="82" name="Select Attributes (2)" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Content|Strengths|Title|Weakness|Comment id|Category"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.0.000-BETA" expanded="true" height="82" name="Set Role" width="90" x="514" y="34">
        <parameter key="attribute_name" value="Category"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="9.0.000-BETA" expanded="true" height="103" name="Split Data" width="90" x="112" y="187">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.7"/>
          <parameter key="ratio" value="0.3"/>
        </enumeration>
        <parameter key="sampling_type" value="shuffled sampling"/>
        <parameter key="use_local_random_seed" value="true"/>
      </operator>
      <operator activated="true" class="dummy" compatibility="9.0.000-BETA" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="187"/>
      <operator activated="true" class="dummy" compatibility="9.0.000-BETA" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="380" y="442"/>
      <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="9.0.000-BETA" expanded="true" height="124" name="Optimize Parameters (Grid)" width="90" x="514" y="187">
        <list key="parameters">
          <parameter key="k-NN.k" value="[1.0;100.0;10;linear]"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="concurrency:cross_validation" compatibility="8.2.000" expanded="true" height="145" name="Cross Validation" width="90" x="179" y="34">
            <process expanded="true">
              <operator activated="true" class="k_nn" compatibility="9.0.000-BETA" expanded="true" height="82" name="k-NN" width="90" x="112" y="34">
                <parameter key="k" value="7"/>
              </operator>
              <connect from_port="training set" to_op="k-NN" to_port="training set"/>
              <connect from_op="k-NN" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="9.0.000-BETA" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="34">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="9.0.000-BETA" expanded="true" height="82" name="Performance" width="90" x="313" y="34">
                <list key="class_weights"/>
              </operator>
              <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="log" compatibility="9.0.000-BETA" expanded="true" height="82" name="Log" width="90" x="380" y="136">
            <list key="log"/>
          </operator>
          <connect from_port="input 1" to_op="Cross Validation" to_port="example set"/>
          <connect from_op="Cross Validation" from_port="model" to_port="model"/>
          <connect from_op="Cross Validation" from_port="performance 1" to_op="Log" to_port="through 1"/>
          <connect from_op="Log" from_port="through 1" to_port="performance"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="free_memory" compatibility="9.0.000-BETA" expanded="true" height="82" name="Free Memory" width="90" x="648" y="187"/>
      <operator activated="true" class="apply_model" compatibility="9.0.000-BETA" expanded="true" height="82" name="Apply Model" width="90" x="849" y="442">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="9.0.000-BETA" expanded="true" height="82" name="Performance (3)" width="90" x="849" y="187">
        <list key="class_weights"/>
      </operator>
      <connect from_op="Retrieve appended-data-eng" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
      <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Process Documents from Data" to_port="in 1"/>
      <connect from_op="Split Data" from_port="partition 2" to_op="Process Documents from Data (2)" to_port="in 1"/>
      <connect from_op="Process Documents from Data" from_port="out 1" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
      <connect from_op="Process Documents from Data (2)" from_port="out 1" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="model" to_op="Free Memory" to_port="through 1"/>
      <connect from_op="Free Memory" from_port="through 1" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 3"/>
      <connect from_op="Performance (3)" from_port="performance" to_port="result 1"/>
      <connect from_op="Performance (3)" from_port="example set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

Rule of thumb, according to CRISP-DM (which is a massive thing):

Understand the business. (You).
Understand the data. (You).
Prepare the data (RapidMiner Studio).
Build the model (RapidMiner Studio).
Evaluate and Optimize the model (RapidMiner Studio).
Deploy the model (RapidMiner Server).

Now:

Model < Validation < Optimization

Hope it helps.

Screen Shot 2018-07-23 at 01.15.30.png

Maerkli

Hallo Rodrigo,

I have deployed your second XML file but Cross Validation and Log Operators don't show up; is it normal? Should it be a subprocess

of Optimize Parameters?

Maerkli.

rfuentealba

BTW, I don't know why the "Process Documents from Data" operator appears red in my example. I know it is red because there is no such operator with that name, but I do have that operator.

Perhaps @neginz can explain the part that I don't have so that we can compose the project properly, or someone else can point me to the solution? (Not that I'm worried about it, just trying to make it easier to others when they read this answer)

Screen Shot 2018-07-25 at 13.23.47.png

rfuentealba

Hi @Maerkli,

Following the principle of: "A visualization is worth a thousand example sets":

Here is what I see when I open the process.

Screen Shot 2018-07-25 at 13.09.06.png

You see the Optimize Parameters operator? It's selected here:

Screen Shot 2018-07-25 at 13.09.14.png

If I double click on it, it opens the following:

Screen Shot 2018-07-25 at 13.12.18.png

There you go, hope it helps.

HINT: If you see an operator that has a double border or otherwise it looks like there are two operators one over the other (like the Cross Validation operator, the Optimize Parameters (Grid) operator, and a few others), it means it's a superoperator, so you can doble click on it and explore its content. In fact, the Subprocess operator is one of these famous superoperators.

Cheers!

Screen Shot 2018-07-25 at 13.09.06.png

Screen Shot 2018-07-25 at 13.09.14.png

Screen Shot 2018-07-25 at 13.12.18.png

rfuentealba

(My two last messages were swapped, read the last before this one first, then the one before the last... cc @sgenzer wat)

neginz

hi @rfuentealba sorry for my delay

I don't know why they are red even when I run your XML code there appear red for me, too even though that was my process !!!

BTW tnx for your help about log operator, it works well .

rfuentealba

Hello, @neginz! Glad it helped. I've marked the solutions as accepted, if you don't mind.

I think that the Process Data From Files operator has an error loading here. By any chance, do you use Windows? I tried your process on Mac. By Occam's Razor, I think the culprit is having different filesystem layouts, hence I'll probably try to reproduce it tomorrow and if I can find the problem, submit a bug report.

All the best!

neginz

@rfuentealba

it was "process from data" operator. and yes I use windows maybe that's the problem...