🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"How to filter lines with regexp with RapidMiner?"

User: "pocakka"
New Altair Community Member
Updated by Jocelyn
Hello!
I have ten millions txt files in a folder (100KB/file), and I would filter special lines from this files.
In UltraEdit I use this regexp:
<strong class="name".*-id-.*
My problem is the large number of files, because the Ultraedit goes wrong...

How can I filter it? RapidMiner could do it?
My process is this:
1. Filter line by this regexp from the ten millions txt:
<strong class="name".*-id-.*
2. The filtered line must be in a new txt file...

Can you help solve my problem?
Thanks,
Attila


Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "MariusHelf"
    New Altair Community Member
    Hi,

    you can use the text processing extension to filter the files. Please have a look at the attached process: inside the process documents operator, the Tokenize operator cuts the document into separate lines, and the next operator, Filter Tokens, selects only the lines containing the word "hallo".

    Best, Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
        <process expanded="true" height="145" width="413">
          <operator activated="true" class="loop_files" compatibility="5.3.000" expanded="true" height="76" name="Loop Files" width="90" x="112" y="30">
            <parameter key="directory" value="C:\Users\mhelf\tmp\test"/>
            <process expanded="true" height="562" width="718">
              <operator activated="true" class="text:read_document" compatibility="5.2.005" expanded="true" height="60" name="Read Document" width="90" x="112" y="30"/>
              <operator activated="true" class="text:process_documents" compatibility="5.2.005" expanded="true" height="94" name="Process Documents" width="90" x="313" y="30">
                <process expanded="true" height="562" width="718">
                  <operator activated="true" class="text:tokenize" compatibility="5.2.005" expanded="true" height="60" name="Tokenize" width="90" x="112" y="30">
                    <parameter key="mode" value="regular expression"/>
                    <parameter key="expression" value="\n"/>
                  </operator>
                  <operator activated="true" class="text:filter_tokens_by_content" compatibility="5.2.005" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="313" y="30">
                    <parameter key="string" value="hallo"/>
                  </operator>
                  <connect from_port="document" to_op="Tokenize" to_port="document"/>
                  <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
                  <connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_document" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="text:wordlist_to_data" compatibility="5.2.005" expanded="true" height="76" name="WordList to Data" width="90" x="514" y="30"/>
              <connect from_port="file object" to_op="Read Document" to_port="file"/>
              <connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
              <connect from_op="Process Documents" from_port="word list" to_op="WordList to Data" to_port="word list"/>
              <connect from_op="WordList to Data" from_port="example set" to_port="out 1"/>
              <portSpacing port="source_file object" spacing="0"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="5.3.000" expanded="true" height="76" name="Append" width="90" x="246" y="30"/>
          <connect from_op="Loop Files" from_port="out 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>