"Possible Bug: Missing Results"

tanto
tanto New Altair Community Member
edited November 5 in Community Q&A
I'm a bit new to RapidMiner, so I don't want to file an official bug report until I get some community feedback.  Using the Text extension, I've been using the Process Data from Files operator with success.  However, when I combine it with the Similarity from Data operator, the results perspective stops working.  The log still reports that everything went fine, but nothing new appears.

This issue continues even after I remove the similarity operator.  The only way to restore normal functioning is to close RapidMiner and delete the perspective XML files.

Am I doing something wrong, or is this a bug?

Answers

  • Marco_Boeck
    Marco_Boeck New Altair Community Member
    Hi,

    can you provide a process (and if it depends on the data, that as well) so we can reproduce it?
    General rule of thumb is if you need to delete some file afterwards to get everything working again there is something which is not working as intended ;)

    Regards,
    Marco
  • IngoRM
    IngoRM New Altair Community Member
    Hi,

    just jumping in: another thing which came to my mind was a closed result history:

    http://rapid-i.com/rapidforum/index.php/topic,3598.msg13402.html

    Maybe it's simply this...

    Cheers,
    Ingo
  • tanto
    tanto New Altair Community Member
    Here's the process that's giving me trouble.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.017">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
       <process expanded="true" height="206" width="279">
         <operator activated="true" class="text:process_document_from_file" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="75">
           <list key="text_directories">
             <parameter key="bills" value="D:\Bills"/>
           </list>
           <parameter key="file_pattern" value="112~h1*"/>
           <parameter key="add_meta_information" value="false"/>
           <parameter key="keep_text" value="true"/>
           <parameter key="prune_method" value="absolute"/>
           <parameter key="prune_below_absolute" value="2"/>
           <parameter key="prune_above_absolute" value="9999"/>
           <process expanded="true" height="596" width="970">
             <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="112" y="120"/>
             <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="541" y="75"/>
             <connect from_port="document" to_op="Tokenize" to_port="document"/>
             <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
             <connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="36"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="data_to_similarity" compatibility="5.1.017" expanded="true" height="76" name="Data to Similarity" width="90" x="263" y="212">
           <parameter key="measure_types" value="NumericalMeasures"/>
           <parameter key="numerical_measure" value="DiceSimilarity"/>
         </operator>
         <connect from_op="Process Documents from Files" from_port="example set" to_op="Data to Similarity" to_port="example set"/>
         <connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
    It appears to be the design perspective XML file that needs deleted.  Here it is before (working):
    <?xml version="1.0"?>
    <VLDocking version="2.1">
    <DockingDesktop name="default">
    <DockingPanel>
    <Split orientation="1" location="0.7996044825313118">
    <Split orientation="1" location="0.24979321753515302">
    <Split orientation="0" location="0.19928400954653938">
    <Dockable>
    <Key dockName="overview"/>
    </Dockable>
    <TabbedDockable>
    <Dockable>
    <Key dockName="new_operator"/>
    </Dockable>
    <Dockable>
    <Key dockName="repository_browser"/>
    </Dockable>
    </TabbedDockable>
    </Split>
    <Split orientation="0" location="0.7995226730310262">
    <TabbedDockable>
    <Dockable>
    <Key dockName="process_panel"/>
    </Dockable>
    <Dockable>
    <Key dockName="xml_editor"/>
    </Dockable>
    </TabbedDockable>
    <TabbedDockable>
    <Dockable>
    <Key dockName="error_table"/>
    </Dockable>
    <Dockable>
    <Key dockName="log_viewer"/>
    </Dockable>
    </TabbedDockable>
    </Split>
    </Split>
    <Split orientation="0" location="0.6599045346062052">
    <Dockable>
    <Key dockName="property_editor"/>
    </Dockable>
    <TabbedDockable>
    <Dockable>
    <Key dockName="operator_help"/>
    </Dockable>
    <Dockable>
    <Key dockName="comment_editor"/>
    </Dockable>
    </TabbedDockable>
    </Split>
    </Split>
    </DockingPanel>
    <TabGroups>
    <TabGroup>
    <Dockable>
    <Key dockName="new_operator"/>
    </Dockable>
    <Dockable>
    <Key dockName="repository_browser"/>
    </Dockable>
    <Dockable>
    <Key dockName="repository_browser"/>
    </Dockable>
    </TabGroup>
    <TabGroup>
    <Dockable>
    <Key dockName="operator_help"/>
    </Dockable>
    <Dockable>
    <Key dockName="comment_editor"/>
    </Dockable>
    <Dockable>
    <Key dockName="comment_editor"/>
    </Dockable>
    </TabGroup>
    <TabGroup>
    <Dockable>
    <Key dockName="error_table"/>
    </Dockable>
    <Dockable>
    <Key dockName="log_viewer"/>
    </Dockable>
    <Dockable>
    <Key dockName="log_viewer"/>
    </Dockable>
    </TabGroup>
    <TabGroup>
    <Dockable>
    <Key dockName="process_panel"/>
    </Dockable>
    <Dockable>
    <Key dockName="xml_editor"/>
    </Dockable>
    <Dockable>
    <Key dockName="xml_editor"/>
    </Dockable>
    </TabGroup>
    </TabGroups>
    </DockingDesktop>
    </VLDocking>
    Here it is after (not working):
    <?xml version="1.0"?>
    <VLDocking version="2.1">
    <DockingDesktop name="default">
    <DockingPanel>
    <Split orientation="1" location="0.7996044825313118">
    <Split orientation="1" location="0.24979321753515302">
    <Split orientation="0" location="0.19928400954653938">
    <Dockable>
    <Key dockName="overview"/>
    </Dockable>
    <TabbedDockable>
    <Dockable>
    <Key dockName="new_operator"/>
    </Dockable>
    <Dockable>
    <Key dockName="repository_browser"/>
    </Dockable>
    </TabbedDockable>
    </Split>
    <Split orientation="0" location="0.7995226730310262">
    <TabbedDockable>
    <Dockable>
    <Key dockName="process_panel"/>
    </Dockable>
    <Dockable>
    <Key dockName="xml_editor"/>
    </Dockable>
    </TabbedDockable>
    <TabbedDockable>
    <Dockable>
    <Key dockName="error_table"/>
    </Dockable>
    <Dockable>
    <Key dockName="log_viewer"/>
    </Dockable>
    </TabbedDockable>
    </Split>
    </Split>
    <Split orientation="0" location="0.6599045346062052">
    <Dockable>
    <Key dockName="property_editor"/>
    </Dockable>
    <TabbedDockable>
    <Dockable>
    <Key dockName="operator_help"/>
    </Dockable>
    <Dockable>
    <Key dockName="comment_editor"/>
    </Dockable>
    </TabbedDockable>
    </Split>
    </Split>
    </DockingPanel>
    <TabGroups>
    <TabGroup>
    <Dockable>
    <Key dockName="new_operator"/>
    </Dockable>
    <Dockable>
    <Key dockName="repository_browser"/>
    </Dockable>
    <Dockable>
    <Key dockName="repository_browser"/>
    </Dockable>
    <Dockable>
    <Key dockName="repository_browser"/>
    </Dockable>
    </TabGroup>
    <TabGroup>
    <Dockable>
    <Key dockName="operator_help"/>
    </Dockable>
    <Dockable>
    <Key dockName="comment_editor"/>
    </Dockable>
    <Dockable>
    <Key dockName="comment_editor"/>
    </Dockable>
    <Dockable>
    <Key dockName="comment_editor"/>
    </Dockable>
    </TabGroup>
    <TabGroup>
    <Dockable>
    <Key dockName="error_table"/>
    </Dockable>
    <Dockable>
    <Key dockName="log_viewer"/>
    </Dockable>
    <Dockable>
    <Key dockName="log_viewer"/>
    </Dockable>
    <Dockable>
    <Key dockName="log_viewer"/>
    </Dockable>
    </TabGroup>
    <TabGroup>
    <Dockable>
    <Key dockName="process_panel"/>
    </Dockable>
    <Dockable>
    <Key dockName="xml_editor"/>
    </Dockable>
    <Dockable>
    <Key dockName="xml_editor"/>
    </Dockable>
    <Dockable>
    <Key dockName="xml_editor"/>
    </Dockable>
    </TabGroup>
    </TabGroups>
    </DockingDesktop>
    </VLDocking>
  • tanto
    tanto New Altair Community Member
    Here's a link to a tarball of the input text that I've been using for testing.

    http://www.mediafire.com/?6t86rwieaw5b12d

    Also, this problem was replicable on another computer (Amazon EC2 instance).
  • IngoRM
    IngoRM New Altair Community Member
    Hi tanto,

    Thanks for the process and data, this really helps to find the problem.

    I can - at least partially - replicate your problem. But the reason is not a broken result display due to the similarity operator but simply a too long runtime for creating the display for the similarity. After about 40 minutes on my computer, the tab for the similarity object has finally been created and it took another 50 minutes until the message "Please standby while the display is created..." vanished and the results finally have been there.

    You can easily try this yourself:
    • Use your process and text data but change the parameter "prune_below_absolute" to 200 and "prune_above_absolute" to 250: it will take about 10 seconds until the tab is created and another 10 seconds until the display creation has finished. The number of created terms is about 100.
    • Now change the parameter "prune_above_absolute" to 500: it will now take about 25 seconds until the tab is created and another 40 seconds until the display creation has finished. The number of created terms with these pruning settings is about 250.
    • You can repeat this by slightly increasing the setting - check the number of created terms and the increase in time. With your pruning settings, you ended up with more than 13000 terms which cause the long display creation times mentioned above...
    So the result will actually be created, but it simply takes more than an hour. In this time one of my computer's CPUs was used 100% of time - so RapidMiner really had some calculations to do. Not too much of a problem if the similarity is used for additional calculations in the rest of an automated process but certainly not too much fun for an interactive exploration of the similarities  :D

    Interesting observation: the number of examples (about 1000) was a smaller problem than the number of attributes. I did actually not have expected this since the number of attributes should contribute only linearly to the necessary runtime for most of the similarity / distance measures. I will think about that and discuss this with the others.

    So this is indeed not really a bug but maybe a chance for an performance improvement for the creation of the similarity viewer (if you like you can still file a report in our bugtracker at http://bugs.rapid-i.com as a feature request and add a link to this conversation here). For now, you have several options like using a stronger pruning / filtering / stemming and other approaches which help to reduce the number of features. If you do not want to look at the similarities themself but simply use them for the rest of the process, I would recommend to filter down the number of attributes during process design like in the small test above and remove the filter afterwards after the full process has been designed.

    Cheers,
    Ingo
  • tanto
    tanto New Altair Community Member
    Thank you very much!  That's solved a lot of my confusion and headaches.

    On a related note, is there a maximum limit to the size of an ExampleSet?  Using a larger data input via the Similarity Data operator, I'm getting negative 637040551 examples in the result set.