Which learner to use with a particular data type

radone
radone New Altair Community Member
edited November 5 in Community Q&A
Greetings,
Is it possible to make some simple tests on data and according to the results decide, which learner will most probably give best performance? Could anyone recommend me some papers (or books) where get some information for this purpose?

Thanks in advance.
Tagged:

Answers

  • fischer
    fischer New Altair Community Member
    Hi,

    apart from the information you will find in the literature, let me just mention how you can easily choose the best learning scheme with RapidMiner. Just wrap a cross validation inside a ParameterOptimization, add an OperatorSelector with various learning operators as the first child of the cross validation, and let the ParameterOptimization optimize the operator selected by the OperatorSelector.

    Cheers,
    Simon
  • jtan
    jtan New Altair Community Member
    Hi Simon,

    Would you kind enough to explain how this can be done in RM 5.0 beta ? Detailed steps would be nice.

    thanks.
  • land
    land New Altair Community Member
    Hi,
    the solution is quite equal: Instead of a OperatorSelector you have to use the Select Subprocess operator, which does quite the same. Inside this operator, you might create new subprocesses by clicking on the button with the green plus.
    For easy understanding I paste an example process below.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="251" width="547">
          <operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
            <parameter key="target_function" value="sum classification"/>
          </operator>
          <operator activated="true" class="optimize_parameters_grid" expanded="true" height="94" name="Optimize Parameters (Grid)" width="90" x="229" y="29">
            <list key="parameters">
              <parameter key="Select Subprocess.select_which" value="[1.0;3.0;3;linear]"/>
            </list>
            <process expanded="true" height="613" width="932">
              <operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="45" y="30">
                <description>A cross-validation evaluating a decision tree model.</description>
                <process expanded="true" height="613" width="441">
                  <operator activated="true" class="select_subprocess" expanded="true" height="76" name="Select Subprocess" width="90" x="112" y="30">
                    <parameter key="select_which" value="3"/>
                    <process expanded="true" height="613" width="441">
                      <operator activated="true" class="decision_tree" expanded="true" height="76" name="Decision Tree" width="90" x="128" y="42"/>
                      <connect from_port="input 1" to_op="Decision Tree" to_port="training set"/>
                      <connect from_op="Decision Tree" from_port="model" to_port="output 1"/>
                      <portSpacing port="source_input 1" spacing="0"/>
                      <portSpacing port="source_input 2" spacing="0"/>
                      <portSpacing port="sink_output 1" spacing="0"/>
                      <portSpacing port="sink_output 2" spacing="0"/>
                    </process>
                    <process expanded="true" height="613" width="277">
                      <operator activated="true" class="naive_bayes" expanded="true" height="76" name="Naive Bayes" width="90" x="53" y="37"/>
                      <connect from_port="input 1" to_op="Naive Bayes" to_port="training set"/>
                      <connect from_op="Naive Bayes" from_port="model" to_port="output 1"/>
                      <portSpacing port="source_input 1" spacing="0"/>
                      <portSpacing port="source_input 2" spacing="0"/>
                      <portSpacing port="sink_output 1" spacing="0"/>
                      <portSpacing port="sink_output 2" spacing="0"/>
                    </process>
                    <process expanded="true" height="613" width="441">
                      <operator activated="true" class="support_vector_machine" expanded="true" height="112" name="SVM" width="90" x="117" y="39"/>
                      <connect from_port="input 1" to_op="SVM" to_port="training set"/>
                      <connect from_op="SVM" from_port="model" to_port="output 1"/>
                      <portSpacing port="source_input 1" spacing="0"/>
                      <portSpacing port="source_input 2" spacing="0"/>
                      <portSpacing port="sink_output 1" spacing="0"/>
                      <portSpacing port="sink_output 2" spacing="0"/>
                    </process>
                  </operator>
                  <connect from_port="training" to_op="Select Subprocess" to_port="input 1"/>
                  <connect from_op="Select Subprocess" from_port="output 1" to_port="model"/>
                  <portSpacing port="source_training" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                  <portSpacing port="sink_through 1" spacing="0"/>
                </process>
                <process expanded="true" height="613" width="441">
                  <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                    <list key="application_parameters"/>
                  </operator>
                  <operator activated="true" class="performance" expanded="true" height="76" name="Performance" width="90" x="180" y="30"/>
                  <connect from_port="model" to_op="Apply Model" to_port="model"/>
                  <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
                  <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
                  <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
                  <portSpacing port="source_model" spacing="0"/>
                  <portSpacing port="source_test set" spacing="0"/>
                  <portSpacing port="source_through 1" spacing="0"/>
                  <portSpacing port="sink_averagable 1" spacing="0"/>
                  <portSpacing port="sink_averagable 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="input 1" to_op="Validation" to_port="training"/>
              <connect from_op="Validation" from_port="averagable 1" to_port="performance"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
              <portSpacing port="sink_result 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
          <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
          <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    Greetings,
      Sebastian