"Optimize Text-Mining-Process"

degu
degu New Altair Community Member
edited November 5 in Community Q&A
Hello everybody,
i want to analyse many e-mails (> 90.000, sum: 955 MB) for checking the word density and other things. My two processes are working, but the performance isn't so good (second process is bad...). Have anybody an idea to optimize the process?

The first process reads all (text-) documents with a loop, cut the document with an regular region (i only need the e-mail-bodys) and save as text-file & a copy into the repository:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <parameter key="logverbosity" value="all"/>
    <process expanded="true">
      <operator activated="true" class="loop_files" compatibility="5.3.015" expanded="true" height="76" name="Loop Files" width="90" x="112" y="30">
        <parameter key="directory" value="C:\Daten\Test"/>
        <parameter key="filter" value=".*\.txt"/>
        <parameter key="file_name_macro" value="file_name_loop"/>
        <parameter key="file_path_macro" value="file_path_loop"/>
        <parameter key="parent_path_macro" value="parent_path_loop"/>
        <parameter key="recursive" value="true"/>
        <parameter key="iterate_over_subdirs" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:read_document" compatibility="5.3.002" expanded="true" height="60" name="Read Document" width="90" x="45" y="30"/>
          <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents" width="90" x="246" y="30">
            <parameter key="create_word_vector" value="false"/>
            <parameter key="prune_below_absolute" value="10"/>
            <parameter key="prune_above_absolute" value="5000"/>
            <process expanded="true">
              <operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document" width="90" x="45" y="30">
                <parameter key="query_type" value="Regular Region"/>
                <list key="string_machting_queries">
                  <parameter key="Header-raus" value="(?m)^\.*(Message-ID:)\.*\\n(\.*\\n)*\\s.(\.*\\n)*"/>
                </list>
                <list key="regular_expression_queries">
                  <parameter key="tet" value="(.*\n)*.*(Message-ID:).*\n(.*\n)*\n"/>
                </list>
                <list key="regular_region_queries">
                  <parameter key="Header-raus" value="(?ms)\\n\\n\.*$.\.*"/>
                </list>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <list key="index_queries"/>
                <process expanded="true">
                  <operator activated="true" class="multiply" compatibility="5.3.015" expanded="true" height="112" name="Multiply" width="90" x="45" y="30"/>
                  <operator activated="true" class="execute_script" compatibility="5.3.015" expanded="true" height="76" name="Execute Script" width="90" x="246" y="30">
                    <parameter key="script" value="// MacroHandler-Import &#10;import com.rapidminer.MacroHandler;&#10;&#10;// MacroHandler-Get&#10;MacroHandler handler = operator.getProcess().getMacroHandler();&#10;&#10;// Hole Macro &quot;path&quot;&#10;String macro = handler.getMacro(&quot;parent_path_loop&quot;);&#10;&#10;// Unterteile &quot;path&quot; in Token (Tokenize)&#10;List tokens = macro.tokenize(&quot;\\\&quot;&quot;);&#10;&#10;// Anzahl Token in Variable schreiben&#10;int anzahl = tokens.size();&#10;&#10;// Token als String in Variable schreiben&#10;String TokenOut = &quot;&quot;;&#10;TokenOut = tokens.get(anzahl-2) + &quot;_&quot; + tokens.get(anzahl-1);&#10;&#10;// Neues Macro mit String-Inhalt&#10;handler.addMacro(&quot;LOOP_CUT_PATH&quot;,TokenOut);"/>
                  </operator>
                  <operator activated="true" class="store" compatibility="5.3.015" expanded="true" height="60" name="Store" width="90" x="179" y="210">
                    <parameter key="repository_entry" value="../data/%{LOOP_CUT_PATH}/%{file_name_loop}"/>
                  </operator>
                  <operator activated="true" class="text:write_document" compatibility="5.3.002" expanded="true" height="76" name="Write Document" width="90" x="246" y="120">
                    <parameter key="file" value="C:\Output\%{LOOP_CUT_PATH}\%{file_name_loop}"/>
                  </operator>
                  <connect from_port="segment" to_op="Multiply" to_port="input"/>
                  <connect from_op="Multiply" from_port="output 1" to_op="Execute Script" to_port="input 1"/>
                  <connect from_op="Multiply" from_port="output 2" to_op="Write Document" to_port="document"/>
                  <connect from_op="Multiply" from_port="output 3" to_op="Store" to_port="input"/>
                  <portSpacing port="source_segment" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                </process>
              </operator>
              <connect from_port="document" to_op="Cut Document" to_port="document"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
            </process>
          </operator>
          <connect from_port="file object" to_op="Read Document" to_port="file"/>
          <connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="out 1"/>
          <portSpacing port="source_file object" spacing="0"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Loop Files" from_port="out 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
The second process reads the documents from repository as a loop and process with Extract Content, Tokenize, Filter Tokens (by Length), Transform Cases, Generate n-Grams (Terms), Filter Stopwords (English) and Stem (Porter). As an output i want a wordlist and a vector. Clustering was planed as an option - at the moment not necessary:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <parameter key="logverbosity" value="status"/>
    <process expanded="true">
      <operator activated="true" class="loop_repository" compatibility="5.3.015" expanded="true" height="76" name="Loop-Repo-Einlesen" width="90" x="45" y="30">
        <parameter key="repository_folder" value="../data/"/>
        <parameter key="entry_type" value="IOObject"/>
        <parameter key="entry_name_macro" value="WORK-ON_entry_name"/>
        <parameter key="repository_path_macro" value="WORK-ON_repository_path"/>
        <parameter key="parent_folder_macro" value="WORK-ON_parent_folder"/>
        <process expanded="true">
          <operator activated="true" class="print_to_console" compatibility="5.3.015" expanded="true" height="76" name="Print to Console" width="90" x="179" y="75">
            <parameter key="log_value" value="Einlesen %{a}"/>
          </operator>
          <connect from_port="repository object" to_port="out 1"/>
          <connect from_port="in 1" to_op="Print to Console" to_port="through 1"/>
          <portSpacing port="source_repository object" spacing="0"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Dokumentenverarbeitung" width="90" x="112" y="165">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prune_below_percent" value="20.0"/>
        <parameter key="prune_above_percent" value="90.0"/>
        <process expanded="true">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.3.001" expanded="true" height="60" name="HTML-Filter" width="90" x="45" y="30">
            <parameter key="minimum_text_block_length" value="1"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Textteilung-Tokenize" width="90" x="179" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Laengenfilter" width="90" x="313" y="30">
            <parameter key="min_chars" value="2"/>
            <parameter key="max_chars" value="999"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Kleinschrift" width="90" x="45" y="120"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.002" expanded="true" height="60" name="Multi-Wort" width="90" x="179" y="120">
            <parameter key="max_length" value="3"/>
          </operator>
          <operator activated="true" class="print_to_console" compatibility="5.3.015" expanded="true" height="60" name="Print to Console (2)" width="90" x="112" y="300">
            <parameter key="log_value" value="Verarbeiten %{a}"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Stop-Englisch" width="90" x="179" y="210"/>
          <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem-Porter-Englisch" width="90" x="313" y="210"/>
          <connect from_port="document" to_op="HTML-Filter" to_port="document"/>
          <connect from_op="HTML-Filter" from_port="document" to_op="Textteilung-Tokenize" to_port="document"/>
          <connect from_op="Textteilung-Tokenize" from_port="document" to_op="Laengenfilter" to_port="document"/>
          <connect from_op="Laengenfilter" from_port="document" to_op="Kleinschrift" to_port="document"/>
          <connect from_op="Kleinschrift" from_port="document" to_op="Multi-Wort" to_port="document"/>
          <connect from_op="Multi-Wort" from_port="document" to_op="Stop-Englisch" to_port="document"/>
          <connect from_op="Stop-Englisch" from_port="document" to_op="Stem-Porter-Englisch" to_port="document"/>
          <connect from_op="Stem-Porter-Englisch" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="multiply" compatibility="5.3.015" expanded="true" height="94" name="MP-Wortliste" width="90" x="246" y="255"/>
      <operator activated="true" class="text:wordlist_to_data" compatibility="5.3.002" expanded="true" height="76" name="Wortlisten-Konvertierung" width="90" x="380" y="300"/>
      <operator activated="true" class="write_excel" compatibility="5.3.015" expanded="true" height="76" name="Excel-Wortliste" width="90" x="447" y="390">
        <parameter key="excel_file" value="D:\Output\Erg_Wortliste.xlsx"/>
        <parameter key="file_format" value="xlsx"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="5.3.015" expanded="true" height="94" name="MP-Vector" width="90" x="246" y="30"/>
      <operator activated="true" class="write_excel" compatibility="5.3.015" expanded="true" height="76" name="Excel-Vector" width="90" x="313" y="165">
        <parameter key="excel_file" value="D:\Output\Erg_Vector.xlsx"/>
        <parameter key="file_format" value="xlsx"/>
      </operator>
      <connect from_op="Loop-Repo-Einlesen" from_port="out 1" to_op="Dokumentenverarbeitung" to_port="documents 1"/>
      <connect from_op="Dokumentenverarbeitung" from_port="example set" to_op="MP-Vector" to_port="input"/>
      <connect from_op="Dokumentenverarbeitung" from_port="word list" to_op="MP-Wortliste" to_port="input"/>
      <connect from_op="MP-Wortliste" from_port="output 1" to_port="result 2"/>
      <connect from_op="MP-Wortliste" from_port="output 2" to_op="Wortlisten-Konvertierung" to_port="word list"/>
      <connect from_op="Wortlisten-Konvertierung" from_port="example set" to_op="Excel-Wortliste" to_port="input"/>
      <connect from_op="MP-Vector" from_port="output 1" to_port="result 1"/>
      <connect from_op="MP-Vector" from_port="output 2" to_op="Excel-Vector" to_port="input"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>
For this process i made test-runs with smaller datasets and thats ok - but the process over the complete dataset (90.000) is running over 9h at the moment and seems to need 10h again to finish...

As data i use for example the "2005 TREC Public Spam Corpus"*  - but i have other datasets with over 140.000 e-mails to analyse.  :P
At the moment the hardware is one Intel Xeon E5645 (6 cores, 12 threads,  2,4 GHz) and 24 GB RAM (Rapidminer usally takes 17 GB) -  the OS Windows Server 2008 R2 Datacenter.


* TREC Data: http://plg.uwaterloo.ca/~gvcormac/treccorpus/- after download and extract you must set .txt - warning: trojans in the emails

Answers

  • fras
    fras New Altair Community Member
    I think "Loop Repository" is the bottleneck. Why not combine the processes and use the loop operator from your first process ?
    For me there is no reason to store it in the repository. But if you really have to store it I would recommend to built a small MySQL database for the job.
    Reading from a real database is much faster.