A bug in normalize operator and in bugzilla

marcin_blachnik
marcin_blachnik New Altair Community Member
edited November 5 in Community Q&A
First bug is related to Bugzilla, when I've try to file a bug a receive following error:
Software error:

Cannot determine local time zone
For help, please send mail to the webmaster (webmaster@bugs.rapid-i.com), giving this error message and the time and date of the error.
So I decided to submit it here, as it looks like an important bug. It is related to normalize operator and independence of its output ports.
The process below shows the problem. The process simply loads the data and performs normalization. As a result only data received from res0 should be normalized, and the data received from res1 should contain the original data, but in fact both outputs are normalized. The funny thing is that when I connect "ori" output port to the process output everything works fine.
The process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve Iris" width="90" x="53" y="202">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.0.001" expanded="true" height="103" name="Multiply" width="90" x="199" y="197"/>
      <operator activated="true" class="normalize" compatibility="7.0.001" expanded="true" height="103" name="Normalize" width="90" x="351" y="34">
        <parameter key="method" value="range transformation"/>
      </operator>
      <connect from_op="Retrieve Iris" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Normalize" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_port="result 2"/>
      <connect from_op="Normalize" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>
Best

Marcin

Answers

  • JEdward
    JEdward New Altair Community Member
    That's due to the way RapidMiner handles data to use less memory. 
    Normally when you are building a RapidMiner process each operator is applied as rules to the data line (for example 'Generate Attributes').  This makes it efficient when running the process as only one set of the data needs to be stored in-memory.  Multiply doesn't create a new copy of the underlying data, it just separates the streams. 

    Some operations such as Normalise & Obfuscate change the underlying data in-memory so to resolve this use the Materialize Data operator to create a new exampleset in-memory.  See below for your example process with the Materialize Data operator added. 
    This is pretty much what the Ori output port is doing in the normalise operator. 

    You can find an indepth explanation on how this works in the How to Extend RapidMiner guides, but I agree that this could be made clearer by RapidMiner itself.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.0.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve Iris" width="90" x="53" y="202">
            <parameter key="repository_entry" value="//Samples/data/Iris"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="7.0.001" expanded="true" height="103" name="Multiply" width="90" x="199" y="197"/>
          <operator activated="true" class="materialize_data" compatibility="7.0.001" expanded="true" height="82" name="Materialize Data" width="90" x="246" y="34"/>
          <operator activated="true" class="normalize" compatibility="7.0.001" expanded="true" height="103" name="Normalize" width="90" x="380" y="34">
            <parameter key="method" value="range transformation"/>
          </operator>
          <connect from_op="Retrieve Iris" from_port="output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Materialize Data" to_port="example set input"/>
          <connect from_op="Multiply" from_port="output 2" to_port="result 2"/>
          <connect from_op="Materialize Data" from_port="example set output" to_op="Normalize" to_port="example set input"/>
          <connect from_op="Normalize" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="168"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • marcin_blachnik
    marcin_blachnik New Altair Community Member
    Thank you for your answer but it is a bug.
    My code (independence of outputs) works perfectly on RapidMiner 5 and RapidMiner 6, it just doesn’t work on RapidMiner 7.
    There is a rule which says that whenever an operator modifies a data it creates a new attribute in the exampleTable and in the exampleSet makes a switch such that the exampleSet reference to the new modified attribute (switch a view).
    I just guess that the developer didn’t clone the exampleSet before applying modifications  so all of the modification are propagated in every copy of the dataset.
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hey,

    do you know about Materialize Data?

    ~Martin
  • marcin_blachnik
    marcin_blachnik New Altair Community Member
    Well

    I know what is Materialize data and how for my own purpose solve this provlem,
    but please explain me why
    1) In RM7.0 this issue appear, and in RM 6.5  everything works ok. Does it mean that RM7 is not compatible with RM 6.5 and processes have to be rechecked 
    2) There is no line in the help that says about this change, and about  the requirement of materialization
    3) Why  Normalization operator works differently according to its output port connection. It works ok when you connect "ori" output somewhere else and then the data on the input side is not being modified, (in other words also the data on the output of the "multiply" operator is not normalized), but when the "ori" output is not connected then Normalization modifies also the exampleset on the input side.

    To show what I'm talking about in 3) take a look at this process:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.0.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve Iris" width="90" x="69" y="201">
            <parameter key="repository_entry" value="//Samples/data/Iris"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="7.0.001" expanded="true" height="103" name="Multiply" width="90" x="216" y="202"/>
          <operator activated="true" class="normalize" compatibility="7.0.001" expanded="true" height="103" name="Normalize" width="90" x="383" y="135">
            <parameter key="method" value="range transformation"/>
          </operator>
          <operator activated="true" class="parallel_decision_tree" compatibility="7.0.001" expanded="true" height="82" name="Decision Tree" width="90" x="386" y="259"/>
          <connect from_op="Retrieve Iris" from_port="output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Normalize" to_port="example set input"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Normalize" from_port="example set output" to_port="result 3"/>
          <connect from_op="Normalize" from_port="original" to_port="result 1"/>
          <connect from_op="Decision Tree" from_port="model" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>
    Now everything works perfect and the tree has correct values on edges, but when you remove the top most process output (connection between ori and process output) you obtain completely different tree.  Even more funny thing is that this issue don't appear on Golf dataset it's just a problem with Iris (both taken from the "samples" repository) I guess that this is because in iris all regular features are numeric.