"FP-Growth and Create Association Operators Hanging"

TyrellCorp
TyrellCorp New Altair Community Member
edited November 5 in Community Q&A
Its been a week working with RapidMiner Studio 5.3 and I have yet to successfully process text files with FP-Growth and Create Association operators. I even tried a one-page text document and it continuously processes without stopping, then it will freeze up. Sometimes it will generate a "Process Failed" message. What am I doing wrong? I'm running RapidMiner on a OS X 10.8.5, 2.3 GHz Intel Core i7, 8G RAM My code is below:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="loop_files" compatibility="5.3.015" expanded="true" height="76" name="Loop Files" width="90" x="45" y="75">
       <parameter key="directory" value="/Users/ddavis/Desktop/Emotion"/>
       <process expanded="true">
         <operator activated="true" class="text:read_document" compatibility="5.3.002" expanded="true" height="60" name="Read Document" width="90" x="112" y="75"/>
         <connect from_port="file object" to_op="Read Document" to_port="file"/>
         <connect from_op="Read Document" from_port="output" to_port="out 1"/>
         <portSpacing port="source_file object" spacing="0"/>
         <portSpacing port="source_in 1" spacing="0"/>
         <portSpacing port="source_in 2" spacing="0"/>
         <portSpacing port="sink_out 1" spacing="0"/>
         <portSpacing port="sink_out 2" spacing="0"/>
       </process>
     </operator>
     <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents" width="90" x="246" y="210">
       <parameter key="vector_creation" value="Binary Term Occurrences"/>
       <parameter key="prune_below_absolute" value="1"/>
       <parameter key="prune_above_absolute" value="999"/>
       <process expanded="true">
         <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="120"/>
         <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="179" y="210"/>
         <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="120"/>
         <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="447" y="210">
           <parameter key="min_chars" value="2"/>
           <parameter key="max_chars" value="9999"/>
         </operator>
         <operator activated="true" class="text:stem_snowball" compatibility="5.3.002" expanded="true" height="60" name="Stem (Snowball)" width="90" x="648" y="120"/>
         <connect from_port="document" to_op="Transform Cases" to_port="document"/>
         <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
         <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
         <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
         <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
         <connect from_op="Stem (Snowball)" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <operator activated="true" class="numerical_to_binominal" compatibility="5.3.015" expanded="true" height="76" name="Numerical to Binominal" width="90" x="447" y="120"/>
     <operator activated="true" class="fp_growth" compatibility="5.3.015" expanded="true" height="76" name="FP-Growth" width="90" x="581" y="255"/>
     <operator activated="true" class="create_association_rules" compatibility="5.3.015" expanded="true" height="76" name="Create Association Rules" width="90" x="782" y="120"/>
     <connect from_port="input 1" to_op="Loop Files" to_port="in 1"/>
     <connect from_op="Loop Files" from_port="out 1" to_op="Process Documents" to_port="documents 1"/>
     <connect from_op="Process Documents" from_port="example set" to_op="Numerical to Binominal" to_port="example set input"/>
     <connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
     <connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
     <connect from_op="Create Association Rules" from_port="rules" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="source_input 2" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>

I found two tutorials on YouTube both of which do a similar process but with success.
https://www.youtube.com/watch?v=HBrYuV8eWjc
https://www.youtube.com/watch?v=oXrUz5CWM4E

There are similar issue I have read (http://rapid-i.com/rapidforum/index.php/topic,7572.0.html, http://rapid-i.com/rapidforum/index.php/topic,7541.0.html)

Answers

  • Marco_Boeck
    Marco_Boeck New Altair Community Member
    Hi,

    what does the error say when it fails? What does the log say (user home/.RapidMiner5/rm.log?
    How do you start RapidMiner and how much memory is available (see "View" -> "Show View" -> "System Monitor")

    Regards,
    Marco
  • haddock
    haddock New Altair Community Member
    Hi there,

    This comes up rather often, and has been recognised as a weakness since April 2011. The problem is that RM 5.3 can only handle small numbers of attributes, because the association rule generator tries to generate the powerset of each frequent itemset found. You may find this link illuminating.

    https://rapid-i.com/rapidforum/index.php/topic,6837.0.html

    I'm not sure whether this got fixed in the new version, perhaps an RM staffer could inform?

    Best

    H
  • TyrellCorp
    TyrellCorp New Altair Community Member
    Hi Marco and H

    I use RapidMiner on Mac and Windows. Below are some of the info requested.

    Mac, 8 GB
    System monitor:
    Max: 1.7 GB
    Total: 1.7 GB

    Windows, 16 GB
    System monitor:
    Max: 10 GB
    Total: 10 GB

    The Log has the message:
    Mar 25, 2014 10:17:48 AM INFO: No filename given for result file, using stdout for logging results!
    Mar 25, 2014 10:17:48 AM INFO: Process starts
    Mar 25, 2014 10:17:48 AM INFO: Loading initial data.
    Mar 25, 2014 10:18:52 AM INFO: Process stopped. Completing current operator.
    Mar 25, 2014 10:18:52 AM INFO: FP-Growth: Process stopped.
    Mar 25, 2014 10:18:52 AM INFO: Process stopped in FP-Growth

    I just downloaded RM 6 and will see if this is still an issue.
  • haddock
    haddock New Altair Community Member
    Hi there,

    Funnily enough I've just attended a webinar which showed a word associator, not in RM, but interesting nevertheless ( datamonkees.wpengine.com ), so I sympathise, as I too have trod this path. As to your work, the problem could be one or more of the following.

    Pre-pain checks...

    Memory - are you really sure that you've given Java enough space?
    Data - put a break in after each of the pre-processing operators, especially the last ( poly - > binominal ), is it always showing what you expect? No missings, all just clean and hunkydory?

    In situ hints...

    KISS - Keep It Simple Sometimes, at least to start. High frequency threshold, look for specific stems, only short sets, or specific items etc... Do whatever you can to produce a few short itemsets. So a break after FP-Growth, just to admire your handiwork.

    FART - Finally Analyse Real Things. Don't bother to worry about making rules until your itemsets roll through. Then you'll see my point kicking in, unless the code has changed long itemsets will choke the association rules generator because of the powerset approach. Only RM staffers can tell you about RM6.

    Good luck with your project, I spend a lot of time in this "meme machine" space, and find it very rewarding as an exploratory tool.

    Best

    H



  • TyrellCorp
    TyrellCorp New Altair Community Member
    Well I got it working but in a bit of a different context: Web Mining. I've provided the XML below:


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.002">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="6.0.002" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
            <parameter key="excel_file" value="/Users/ddavis/Desktop/WEB.xlsx"/>
            <parameter key="imported_cell_range" value="A1:A7"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations">
              <parameter key="0" value="Name"/>
            </list>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="Link.true.file_path.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="web:retrieve_webpages" compatibility="5.3.001" expanded="true" height="60" name="Get Pages" width="90" x="112" y="165">
            <parameter key="link_attribute" value="Link"/>
          </operator>
          <operator activated="true" class="text:data_to_documents" compatibility="5.3.002" expanded="true" height="60" name="Data to Documents" width="90" x="179" y="255">
            <list key="specify_weights"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents" width="90" x="313" y="120">
            <parameter key="keep_text" value="true"/>
            <parameter key="prune_method" value="absolute"/>
            <parameter key="prune_below_absolute" value="2"/>
            <parameter key="prune_above_absolute" value="9999"/>
            <process expanded="true">
              <operator activated="true" class="web:extract_html_text_content" compatibility="5.3.001" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30">
                <parameter key="minimum_text_block_length" value="3"/>
              </operator>
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="179" y="120"/>
              <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="246" y="255"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="380" y="120"/>
              <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.002" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="514" y="255">
                <parameter key="max_length" value="3"/>
              </operator>
              <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="648" y="120"/>
              <operator activated="false" class="text:filter_tokens_by_pos" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by POS Tags)" width="90" x="782" y="210">
                <parameter key="expression" value="VB.*"/>
              </operator>
              <connect from_port="document" to_op="Extract Content" to_port="document"/>
              <connect from_op="Extract Content" from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
              <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
              <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="data_to_similarity" compatibility="6.0.002" expanded="true" height="76" name="Data to Similarity" width="90" x="447" y="30">
            <parameter key="measure_types" value="NumericalMeasures"/>
            <parameter key="numerical_measure" value="CosineSimilarity"/>
          </operator>
          <operator activated="true" class="numerical_to_binominal" compatibility="6.0.002" expanded="true" height="76" name="Numerical to Binominal" width="90" x="514" y="165"/>
          <operator activated="true" class="fp_growth" compatibility="6.0.002" expanded="true" height="76" name="FP-Growth" width="90" x="581" y="300"/>
          <operator activated="true" class="create_association_rules" compatibility="6.0.002" expanded="true" height="76" name="Create Association Rules" width="90" x="715" y="165"/>
          <connect from_op="Read Excel" from_port="output" to_op="Get Pages" to_port="Example Set"/>
          <connect from_op="Get Pages" from_port="Example Set" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_op="Data to Similarity" to_port="example set"/>
          <connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
          <connect from_op="Data to Similarity" from_port="example set" to_op="Numerical to Binominal" to_port="example set input"/>
          <connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
          <connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
          <connect from_op="Create Association Rules" from_port="rules" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>

    Before marking this SOLVED, I'd like to try this with "Process Documents From Files" operator.

    Tyrell Corporation