Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Need recommendation for prediction/regression workflow

Greetings one and all

I am getting acquainted to machine learning in Rapidminer and essentially I'm concerned with a prediction problem. My CSV with about 5000 Examples contains 10 predictor attributes and 2 target attributes. I have a few queries with regards to Rapidminer's design:

1) I understand that only 1 target attribute (or prediction) can be set. Would I be able to predict both using a single process?

2) I am interested in using the new Deep Learning operator in performing the training. What are the recommended preprocessing steps? I can think of 1) filtering (missing values); 2) Normalizing. Do correlated attributes need to be removed manually?

3) For splitting of data into training, testing and validation, am I supposed to simply use Cross Validation with the Deep Learning operator nested within it? What about Split Validation? Does these operators split the original data into the 3 sets?

4) Can the deep learning operator handle a mixture of categorical and numerical? Is there no one-hot encoding necessary within Rapidminer, or do I need to preprocess using Nominal to Numerical (dummy coding)? For categorical variables, is the polynominal role suitable to describe it? I noticed there is also a 'text' class.

5) What does the 'reproducible' function do within the DL operator?

6) Is it possible to 'deploy' a trained DL model to an operational scenario?

7) When importing data using the Import Config. Wizard, could I skip defining the roles and instead use the Set Roles function in the designer?

My apologies for the many questions. I find Rapidminer to be a powerful tool and really user friendly. Would like to take the time to really understand it. Thank you very much.

Regards

Corse

Find more posts tagged with

AI Studio

Regression

Accepted answers

All comments

jczogalla

Hi Corse!

I'm not very familiar with the DL operator, but I will try to answer some of the other questions.

1) Yes, you can do something similar to this:

Spoiler

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
    <input/>
    <output/>
    <macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data" compatibility="7.5.003" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
        <parameter key="target_function" value="polynomial"/>
        <description align="center" color="transparent" colored="false" width="126">Data with label</description>
      </operator>
      <operator activated="true" class="generate_aggregation" compatibility="7.5.003" expanded="true" height="82" name="Generate Aggregation" width="90" x="179" y="34">
        <parameter key="attribute_name" value="sum"/>
        <description align="center" color="transparent" colored="false" width="126">Add second label</description>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
        <parameter key="attribute_name" value="sum"/>
        <parameter key="target_role" value="sumlabel"/>
        <list key="set_additional_roles"/>
        <description align="center" color="transparent" colored="false" width="126">Set role of second label to &quot;sumlabel&quot;</description>
      </operator>
      <operator activated="true" class="support_vector_machine" compatibility="7.5.003" expanded="true" height="124" name="SVM" width="90" x="447" y="34">
        <description align="center" color="transparent" colored="false" width="126">Predict for &quot;label&quot; attribute</description>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role (2)" width="90" x="45" y="238">
        <parameter key="attribute_name" value="label"/>
        <parameter key="target_role" value="polylabel"/>
        <list key="set_additional_roles">
          <parameter key="sum" value="label"/>
        </list>
        <description align="center" color="transparent" colored="false" width="126">Set first label to something, set second label to &quot;label&quot;</description>
      </operator>
      <operator activated="true" class="support_vector_machine" compatibility="7.5.003" expanded="true" height="124" name="SVM (2)" width="90" x="179" y="238">
        <description align="center" color="transparent" colored="false" width="126">Predict &quot;sum&quot; attribute</description>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.5.003" expanded="true" height="82" name="Apply Model" width="90" x="246" y="442">
        <list key="application_parameters"/>
        <description align="center" color="transparent" colored="false" width="126">Apply first model</description>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role (3)" width="90" x="380" y="442">
        <parameter key="attribute_name" value="prediction(label)"/>
        <parameter key="target_role" value="predictionold"/>
        <list key="set_additional_roles"/>
        <description align="center" color="transparent" colored="false" width="126">Set prediction label to something</description>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.5.003" expanded="true" height="82" name="Apply Model (2)" width="90" x="581" y="391">
        <list key="application_parameters"/>
        <description align="center" color="transparent" colored="false" width="126">Apply second model</description>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Generate Aggregation" to_port="example set input"/>
      <connect from_op="Generate Aggregation" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="SVM" to_port="training set"/>
      <connect from_op="SVM" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="SVM" from_port="exampleSet" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="SVM (2)" to_port="training set"/>
      <connect from_op="SVM (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="SVM (2)" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Set Role (3)" to_port="example set input"/>
      <connect from_op="Set Role (3)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
</operator>
</process>

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process"> <process expanded="true"> <operator activated="true" class="generate_data" compatibility="7.5.003" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34"> <parameter key="target_function" value="polynomial"/> <description align="center" color="transparent" colored="false" width="126">Data with label</description> </operator> <operator activated="true" class="generate_aggregation" compatibility="7.5.003" expanded="true" height="82" name="Generate Aggregation" width="90" x="179" y="34"> <parameter key="attribute_name" value="sum"/> <description align="center" color="transparent" colored="false" width="126">Add second label</description> </operator> <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role" width="90" x="313" y="34"> <parameter key="attribute_name" value="sum"/> <parameter key="target_role" value="sumlabel"/> <list key="set_additional_roles"/> <description align="center" color="transparent" colored="false" width="126">Set role of second label to &quot;sumlabel&quot;</description> </operator> <operator activated="true" class="support_vector_machine" compatibility="7.5.003" expanded="true" height="124" name="SVM" width="90" x="447" y="34"> <description align="center" color="transparent" colored="false" width="126">Predict for &quot;label&quot; attribute</description> </operator> <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role (2)" width="90" x="45" y="238"> <parameter key="attribute_name" value="label"/> <parameter key="target_role" value="polylabel"/> <list key="set_additional_roles"> <parameter key="sum" value="label"/> </list> <description align="center" color="transparent" colored="false" width="126">Set first label to something, set second label to &quot;label&quot;</description> </operator> <operator activated="true" class="support_vector_machine" compatibility="7.5.003" expanded="true" height="124" name="SVM (2)" width="90" x="179" y="238"> <description align="center" color="transparent" colored="false" width="126">Predict &quot;sum&quot; attribute</description> </operator> <operator activated="true" class="apply_model" compatibility="7.5.003" expanded="true" height="82" name="Apply Model" width="90" x="246" y="442"> <list key="application_parameters"/> <description align="center" color="transparent" colored="false" width="126">Apply first model</description> </operator> <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role (3)" width="90" x="380" y="442"> <parameter key="attribute_name" value="prediction(label)"/> <parameter key="target_role" value="predictionold"/> <list key="set_additional_roles"/> <description align="center" color="transparent" colored="false" width="126">Set prediction label to something</description> </operator> <operator activated="true" class="apply_model" compatibility="7.5.003" expanded="true" height="82" name="Apply Model (2)" width="90" x="581" y="391"> <list key="application_parameters"/> <description align="center" color="transparent" colored="false" width="126">Apply second model</description> </operator> <connect from_op="Generate Data" from_port="output" to_op="Generate Aggregation" to_port="example set input"/> <connect from_op="Generate Aggregation" from_port="example set output" to_op="Set Role" to_port="example set input"/> <connect from_op="Set Role" from_port="example set output" to_op="SVM" to_port="training set"/> <connect from_op="SVM" from_port="model" to_op="Apply Model" to_port="model"/> <connect from_op="SVM" from_port="exampleSet" to_op="Set Role (2)" to_port="example set input"/> <connect from_op="Set Role (2)" from_port="example set output" to_op="SVM (2)" to_port="training set"/> <connect from_op="SVM (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/> <connect from_op="SVM (2)" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/> <connect from_op="Apply Model" from_port="labelled data" to_op="Set Role (3)" to_port="example set input"/> <connect from_op="Set Role (3)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/> <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator></process>

It's important to keep both labels as special roles (you can name them how ever you like), and before using a learner, just set the role of the label you want to predict to "label". After the apply model, set the prediction attribute to another special role because otherwise it might be discarded for the second model. I hope the process helps to understand.

2) Both preprocessing steps make sense. Remember that normalization creates a preprocessing model that needs to be grouped with the learner model. I'm not sure if you have to remove correlated attributes, but there is an operator for that.

3) Both split validation and cross validation can be used. The data will be split into training and test sets accordingly, and the performance will be measured/validated over all splits. If you connect the model output port, it will generate the model over all examples.

6) Usually, models can be stored and used in another process so long as the new data has the same format for the regular attributes as the data you trained it on. It should not be different for the DL model.

7) Yes, you can skip the set roles during the configuration and just set the roles with the corresponding operator. Remember that up until then, all attributes are considered as regular and will be used as such e.g. when using a learner.

I hope this helps!

Jan

cyborghijacker

Thank you for the concise clarification, Thomas. I have a few queries about Normalization, if you don't mind :

1) I would assume you mean something like the process shown below. If I put the normalization on the training data, would the test data be similarly normalized?

2) What is the purpose of the Model output that is delivered by the Normalize operator? In what situation is it actually used?

3) Assuming my purpose is for a NN prediction via Deep Learning, where should I use the De-normalize operator? I would like my output (i.e. prediction) not to be a normalized value. I noticed that De-normalize is typically connected to the output 'pre' of the Normalize operator, I am confused by what this does - isn't it simply negating the effect of the Normalize operator?

4) For the Cross Validation, is the 'mod' output from the Apply Model operator necessary to be connected to something (I am not sure what) for the Cross Validation to deliver the 'mod' output?

5) In the Deep Learning operator, there is the option to 'standardize' -> could this be an inbuilt normalization parameter?

Regards

Ben

rapidminer_qn2.png

rapidminer_qn3.png

Quick Links