Rapid Miner is giving different results in 32bit machine and 64bit machines

subhasisdasgupt
subhasisdasgupt New Altair Community Member
edited November 5 in Community Q&A
I was using Rapid Miner V5.2.002 in a 32bit machine and now I am using V5.2.008 in a 64bit machine. While executing the same process in these two versions I am getting very different results. I am confused which one to consider. I was just trying to analyze reviews of Samsung Galaxy S3 through text mining and I used X-Mean process to cluster the documents. With everything same, my 32bit machine gave two clusters with 141 and 60 documents in cluster 0 and cluster 1 respectively and my 64bit machine gave two clusters with 197 and 4 documents in cluster 0 and cluster 1 respectively. I donno whether it is a bug or not but yes, I am very confused. Kindly help. The XML (V5.2.002 and 32bit machine) file is given below

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.002">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.002" expanded="true" name="Process">
    <process expanded="true" height="1016" width="413">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
        <list key="text_directories">
          <parameter key="Samsung" value="D:\Subhasis\text mining\Samsung G3 Review"/>
        </list>
        <parameter key="keep_text" value="true"/>
        <process expanded="true" height="418" width="480">
          <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.2.004" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="45" y="165"/>
          <operator activated="false" class="text:stem_snowball" compatibility="5.2.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="179" y="255"/>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="5.2.004" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="246" y="165">
            <parameter key="condition" value="equals"/>
            <parameter key="string" value="i"/>
            <parameter key="regular_expression" value="( i )"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
          <connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" breakpoints="after" class="x_means" compatibility="5.2.002" expanded="true" height="76" name="X-Means" width="90" x="45" y="120">
        <parameter key="determine_good_start_values" value="true"/>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_op="X-Means" to_port="example set"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_port="result 1"/>
      <connect from_op="X-Means" from_port="cluster model" to_port="result 2"/>
      <connect from_op="X-Means" from_port="clustered set" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>
Tagged:

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    This sounds definitely like a bug - the results *should* be the same throughout versions and architectures. Which version of text processing are you using? Can you check what happens if you update the 32bit version to 5.2.008, too?

    Regards,
    Marius
  • subhasisdasgupt
    subhasisdasgupt New Altair Community Member
    I tried V5.2.008 in the 32bit machine also and the output is same as V5.2.002 in 32bit machine. It looks like the program is not working properly in my 64bit machine. Is there any separate installation of RM for 64bit machines?
  • MariusHelf
    MariusHelf New Altair Community Member
    Yes, from our website you can download both a 32bit installer and a 64 bit installer. The 32bit version also runs on a 64bit machine, and both versions *should* deliver identical results. To make things clear, is it correct that you are currently using the 32bit version of RapidMiner on both a 32bit machine and on a 64bit machine, and you observed the aforementioned different results?

    And do you have the same results if you run the process two times in a row on any of the systems?

    Our first guess is that the issue may be related to the Random Number Generator. We will definitely investigate the issue, as it's really quite important. It would be nice if you could also test the 64 bit installer and tell us about the results.

    Best regards,
    Marius
  • subhasisdasgupt
    subhasisdasgupt New Altair Community Member
    Well, this is what I did in my 64bit machine.....

    Uninstalled the installed version and reinstalled the 64bit version (5.2.008) of RM and tested the setup. Result was found same (197 documents in cluster 0 and 4 documents in cluster 1). The answers were different for 32 bit machine and 64bit machine. Then I uninstalled the 64bit version and reinstalled the 32bit version of RM and re-tested the setup. The results of both 32bit and 64bit versions of RM were same  in the 64bit machine. Infact, 32bit version of RM took longer time to produce the output. So the final outcome is something like this

    On 32bit desktop machine 32bit RM V5.2.002 and 32bit RM V5.2.008 produced identical results (141 documents in cluster 0 and 60 documents in cluster 1) and on 64bit machine both 32bit and 64bit RM V5.2.008 produced identical results (197 documents in cluster 0 and 4 documents in cluster 1). So, I am thinking of reinstalling the 64bit RM to my 64bit machine.

    Is there any way to check which one is correct or more appropriate?

    Regards
    Subhasis