How to find sentences and to group results

neomzw
neomzw New Altair Community Member
edited November 5 in Community Q&A
I'm looking for the frequency of using some words and sentences in a directory of files. i would like to compare them all at once (the use of the words and the use of the sentences). I already have created regular expressions for the sentences I'm looking for in the text. 
My questions are: 
(1) how to search for sentences with a specific pattern?
I've used Tokenize and Filter Tokens for the words, but for the sentences I didn't know what to use.
(2)how to group results per project (each project is a folder of subfolders and text files) and per group of projects (a directory of zipped folders).
The results i'm getting so far are in tables showing a row per file instead of per folder or directory.

Tank you

Answers

  • MarlaBot
    MarlaBot New Altair Community Member
    Hi @neomzw - this is MarlaBot. I found these great videos on our RapidMiner Academy that you may find helpful:
    Instructional Video: Text Association Rules (Viewing time: ~10m)
    Instructional Video: Loading Text into RapidMiner (Viewing time: ~6m)
    Please LIKE my comment if it helps! 👇

    MarlaBot <3
  • sgenzer
    sgenzer
    Altair Employee
    hi @neomzw sorry no one has chimed in here. Is this still an issue?

    Scott
  • kayman
    kayman New Altair Community Member
    If you still need an answer : the file folder problem can be solved by setting the 'enable macros' option in the parameter part of the loop files operator and generate a new field that will contain the needed values (like filename or folder). From there you can use other loop operators (like loop values to aggregate on the newly created folder field).

    As in attached example : 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="UTF-8"/>
        <process expanded="true">
          <operator activated="true" class="concurrency:loop_files" compatibility="9.3.001" expanded="true" height="82" name="Loop Files" width="90" x="246" y="34">
            <parameter key="filter_type" value="glob"/>
            <parameter key="recursive" value="false"/>
            <parameter key="enable_macros" value="true"/>
            <parameter key="macro_for_file_name" value="file_name"/>
            <parameter key="macro_for_file_type" value="file_type"/>
            <parameter key="macro_for_folder_name" value="folder_name"/>
            <parameter key="reuse_results" value="false"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="generate_attributes" compatibility="9.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="246" y="34">
                <list key="function_descriptions">
                  <parameter key="MyFolder" value="%{folder_name}"/>
                </list>
                <parameter key="keep_all" value="true"/>
              </operator>
              <connect from_port="file object" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_port="output 1"/>
              <portSpacing port="source_file object" spacing="0"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>