How to use RapidMiner in production?

fridental
fridental New Altair Community Member
edited November 5 in Community Q&A
Hi there!

after having trained a classifier model, I want to establish an authomated process that would get production data once a day, run the classifier and store preduction. I would like to use CSV as Import/Export format. But after getting the first batch of production data, I've stumbled upon the warning "The internal nominal mappings are not the same between training and application for attribute XXX", and the model cannot be applied.

I've found this topic somewhat explaining what is happening: http://rapid-i.com/rapidforum/index.php?topic=77.0

But first I don't want to fix the *.aml files manually because the model application process should be automated, and second I haven't found any *aml files in the RapidMiner Studio 6.0.

What is the intended solution to use RapidMiner Studio in a Big Data production system? I cannot use the "Read Database" operator as my ETL logic cannot be expressed in terms of a single SQL query. Does the Server edition have the same problem?

Regards,
Maxim
Tagged:

Answers

  • fridental
    fridental New Altair Community Member
    I'm currently considering RapidMiner as the main tool of our machine learning activities, but unless it can be used in production, it doesn't have any chance. I'm also a little confused, because what's the point of training a machine learning model if it cannot be used on another dataset?.. Can it be that I have a very specific problem, or miss some obvious solution?

    Using Python for production is another option, but re-implementing models learned in Rapidminer to Python seems to be a bad idea, because not all algorithms are available, and possibly, some algorithms are implemented differently. So now I'm considering to perform all of my machine learning activites in Python, even though it doesn't have such a comfortable GUI...



  • JEdward
    JEdward New Altair Community Member
    It's not actually too clear what you're wanting to do. 

    What model types have you implemented? 
    - Is there any XML you can share with us, it's possible that you've made a small error and someone on the forum could help spot it. 

    How exactly are you deploying the system? 
    - Are you taking a CSV file into RapidMiner, applying a model and then exporting out scored results?  That should be pretty straightforward.  Using RapidAnalytics you can do this by setting up a WebService so your system can do the entire process automatically. 
    - You mention Big Data, so you mean with Hadoop, etc rather than an SQL database?  In which case you need the Big Data version of RapidMiner so you're better talking directly with them for advice, particularly as you get paid support with that product.  (If you're paying for it, you may as well use it, right?  ;) )

    Any more information would be appreciated. 
    Cheers,
    JEdward.
  • fridental
    fridental New Altair Community Member
    Hello JEdward,

    thank you for your reply.

    I have trained a bayes kernel model using this process:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.003" expanded="true" name="Process">
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
            <parameter key="csv_file" value="C:\tmp\rapidminer\training_data.csv"/>
            <parameter key="decimal_character" value=","/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations">
              <parameter key="0" value="Name"/>
            </list>
            <parameter key="encoding" value="windows-1252"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="a.true.integer.id"/>
              <parameter key="1" value="b.true.binominal.attribute"/>
              <parameter key="2" value="c.true.binominal.attribute"/>
              <parameter key="3" value="d.true.integer.attribute"/>
              <parameter key="4" value="e.true.integer.attribute"/>
              <parameter key="5" value="f.true.integer.attribute"/>
              <parameter key="6" value="g.true.binominal.attribute"/>
              <parameter key="7" value="h.true.binominal.attribute"/>
              <parameter key="8" value="i.true.binominal.attribute"/>
              <parameter key="9" value="j.true.binominal.attribute"/>
              <parameter key="10" value="k.true.real.attribute"/>
              <parameter key="11" value="l.true.real.attribute"/>
              <parameter key="12" value="m.true.real.attribute"/>
              <parameter key="13" value="n.true.integer.attribute"/>
              <parameter key="14" value="o.true.real.attribute"/>
              <parameter key="15" value="p.true.integer.attribute"/>
              <parameter key="16" value="q.true.integer.attribute"/>
              <parameter key="17" value="r.true.integer.attribute"/>
              <parameter key="18" value="class.true.binominal.label"/>
            </list>
          </operator>
          <operator activated="true" class="normalize" compatibility="6.0.003" expanded="true" height="94" name="Normalize" width="90" x="179" y="30"/>
          <operator activated="true" class="store" compatibility="6.0.003" expanded="true" height="60" name="Store" width="90" x="313" y="165">
            <parameter key="repository_entry" value="../data/Rapidminer normalization"/>
          </operator>
          <operator activated="true" class="x_validation" compatibility="6.0.003" expanded="true" height="112" name="Validation" width="90" x="447" y="30">
            <parameter key="number_of_validations" value="100"/>
            <process expanded="true">
              <operator activated="true" class="naive_bayes_kernel" compatibility="6.0.003" expanded="true" height="76" name="Naive Bayes (Kernel)" width="90" x="112" y="30">
                <parameter key="minimum_bandwidth" value="0.01"/>
              </operator>
              <connect from_port="training" to_op="Naive Bayes (Kernel)" to_port="training set"/>
              <connect from_op="Naive Bayes (Kernel)" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="6.0.003" expanded="true" height="76" name="Apply Model" width="90" x="112" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_binominal_classification" compatibility="6.0.003" expanded="true" height="76" name="Performance" width="90" x="313" y="30">
                <parameter key="main_criterion" value="recall"/>
                <parameter key="accuracy" value="false"/>
                <parameter key="precision" value="true"/>
                <parameter key="recall" value="true"/>
              </operator>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="store" compatibility="6.0.003" expanded="true" height="60" name="Store (2)" width="90" x="648" y="165">
            <parameter key="repository_entry" value="../data/Rapidminer Bayes model"/>
          </operator>
          <connect from_op="Read CSV" from_port="output" to_op="Normalize" to_port="example set input"/>
          <connect from_op="Normalize" from_port="example set output" to_op="Validation" to_port="training"/>
          <connect from_op="Normalize" from_port="preprocessing model" to_op="Store" to_port="input"/>
          <connect from_op="Validation" from_port="model" to_op="Store (2)" to_port="input"/>
          <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
          <connect from_op="Store (2)" from_port="through" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    and now trying to apply the model on the live data using this process:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.003" expanded="true" name="Process">
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="60" name="Live Data" width="90" x="45" y="300">
            <parameter key="csv_file" value="C:\tmp\rapidminer\live_data.csv"/>
            <parameter key="decimal_character" value=","/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations">
              <parameter key="0" value="Name"/>
            </list>
            <parameter key="encoding" value="windows-1252"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="a.true.integer.id"/>
              <parameter key="1" value="b.true.binominal.attribute"/>
              <parameter key="2" value="c.true.binominal.attribute"/>
              <parameter key="3" value="d.true.integer.attribute"/>
              <parameter key="4" value="e.true.integer.attribute"/>
              <parameter key="5" value="f.true.integer.attribute"/>
              <parameter key="6" value="g.true.binominal.attribute"/>
              <parameter key="7" value="h.true.binominal.attribute"/>
              <parameter key="8" value="i.true.binominal.attribute"/>
              <parameter key="9" value="j.true.binominal.attribute"/>
              <parameter key="10" value="k.true.real.attribute"/>
              <parameter key="11" value="l.true.real.attribute"/>
              <parameter key="12" value="m.true.real.attribute"/>
              <parameter key="13" value="n.true.integer.attribute"/>
              <parameter key="14" value="o.true.real.attribute"/>
              <parameter key="15" value="p.true.integer.attribute"/>
              <parameter key="16" value="q.true.integer.attribute"/>
              <parameter key="17" value="r.true.integer.attribute"/>
              <parameter key="18" value="class.true.binominal.label"/>
            </list>
          </operator>
          <operator activated="true" class="retrieve" compatibility="6.0.003" expanded="true" height="60" name="Normalization model" width="90" x="45" y="165">
            <parameter key="repository_entry" value="//Local Repository/data/Rapidminer normalization"/>
          </operator>
          <operator activated="true" class="retrieve" compatibility="6.0.003" expanded="true" height="60" name="Prediction model" width="90" x="246" y="30">
            <parameter key="repository_entry" value="../data/Rapidminer Bayes model"/>
          </operator>
          <operator activated="false" class="read_csv" compatibility="6.0.003" expanded="true" height="60" name="Training Set" width="90" x="45" y="390">
            <parameter key="csv_file" value="C:\tmp\rapidminer\training_data.csv"/>
            <parameter key="decimal_character" value=","/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations">
              <parameter key="0" value="Name"/>
            </list>
            <parameter key="encoding" value="windows-1252"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="a.true.integer.id"/>
              <parameter key="1" value="b.true.binominal.attribute"/>
              <parameter key="2" value="c.true.binominal.attribute"/>
              <parameter key="3" value="d.true.integer.attribute"/>
              <parameter key="4" value="e.true.integer.attribute"/>
              <parameter key="5" value="f.true.integer.attribute"/>
              <parameter key="6" value="g.true.binominal.attribute"/>
              <parameter key="7" value="h.true.binominal.attribute"/>
              <parameter key="8" value="i.true.binominal.attribute"/>
              <parameter key="9" value="j.true.binominal.attribute"/>
              <parameter key="10" value="k.true.real.attribute"/>
              <parameter key="11" value="l.true.real.attribute"/>
              <parameter key="12" value="m.true.real.attribute"/>
              <parameter key="13" value="n.true.integer.attribute"/>
              <parameter key="14" value="o.true.real.attribute"/>
              <parameter key="15" value="p.true.integer.attribute"/>
              <parameter key="16" value="q.true.integer.attribute"/>
              <parameter key="17" value="r.true.integer.attribute"/>
              <parameter key="18" value="class.true.binominal.label"/>
            </list>
          </operator>
          <operator activated="true" class="apply_model" compatibility="6.0.003" expanded="true" height="76" name="Normalization" width="90" x="246" y="210">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="6.0.003" expanded="true" height="76" name="Prediction" width="90" x="380" y="120">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="6.0.003" expanded="true" height="94" name="Multiply" width="90" x="514" y="120"/>
          <operator activated="true" class="performance_binominal_classification" compatibility="6.0.003" expanded="true" height="76" name="Performance" width="90" x="715" y="165">
            <parameter key="main_criterion" value="precision"/>
            <parameter key="accuracy" value="false"/>
            <parameter key="precision" value="true"/>
            <parameter key="recall" value="true"/>
          </operator>
          <connect from_op="Live Data" from_port="output" to_op="Normalization" to_port="unlabelled data"/>
          <connect from_op="Normalization model" from_port="output" to_op="Normalization" to_port="model"/>
          <connect from_op="Prediction model" from_port="output" to_op="Prediction" to_port="model"/>
          <connect from_op="Normalization" from_port="labelled data" to_op="Prediction" to_port="unlabelled data"/>
          <connect from_op="Prediction" from_port="labelled data" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_port="result 1"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    Here is my training data: https://www.dropbox.com/s/1a5ylc09kxsz9te/training_data.csv
    The models I'm storing in the repository are here: https://www.dropbox.com/s/sooclcwkfjqves0/Rapidminer%20models.zip
    When applying this live data https://www.dropbox.com/s/xu7dku0gmktt74t/live_data.csv , I'm getting the following warnings:

    Jun 23, 2014 10:23:07 AM WARNING: KernelDistribution: The number of nominal values is not the same for training and application for attribute 'b', training: 2, application: 1
    Jun 23, 2014 10:23:07 AM WARNING: KernelDistribution: The internal nominal mappings are not the same between training and application for attribute 'c'. This will probably lead to wrong results during model application.
    Jun 23, 2014 10:23:07 AM WARNING: KernelDistribution: The number of nominal values is not the same for training and application for attribute 'h', training: 2, application: 1
    Jun 23, 2014 10:23:07 AM WARNING: KernelDistribution: The number of nominal values is not the same for training and application for attribute 'i', training: 2, application: 1
    and the model produces no prediction.

    Thanks for reading :)

    Best,
    Maxim
  • fridental
    fridental New Altair Community Member
    Just to inform the community, I've reimplemented the processes in Python now and plan to use Python for production use. I plan to begin next projects directly in Python, because I will then be able to reuse the ETL and data cleansing parts.

    I'm still using Rapidminer Studio to learn about other ML algorithms, but currently I don't see any more value in it besides of being a learning tool.
  • Hello fridental,

    There's an operator called "Add" that is used to declare possible nominal values seen in the training data but which are not seen in the test data. There's also the Remap Binominals operator that can be used to change which values map to true and false. These will eliminate the errors you are seeing in the log file.

    Regards

    Andrew