"[SOLVED] Loop over files (extracting id from first line)"

earmijo
earmijo New Altair Community Member
edited November 5 in Community Q&A
Dear experts:

I have about 2,000 text files with the following structure:
First line: customer id followed by an colon
Next k lines : data about transactions of customers (x1,x2,x3,x4)

-----file01.txt------------------
01:
x11,x12,x13,x14
x21,x22,x23,x24
.....
xk1,xk2,xk3,xk4
-----file02.txt-------------------
02:
x11,x12,x13,x14
x21,x22,x23,x24
.....
xk1,xk2,xk3,xk4
-----file03.txt-------------------
03:
x11,x12,x13,x14
x21,x22,x23,x24
.....
xk1,xk2,xk3,xk4
-----file04.txt--------------------
04:
x11,x12,x13,x14
x21,x22,x23,x24
.....
xk1,xk2,xk3,xk4


What I would like is to merge them into a single file with the following columns

id,x1,x2,x3,x4

Is there an easy way to do it inside RapidMiner?

Thanks in advance,

\Ernesto

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Of course you can do it with RapidMiner :D Please have a look at the attached process. It uses a Loop Files operator to iterate over all files. It reads them line by line, extracts the first one using Extract Macro, removes the first line and then splits the remaining lines at the commas.

    Best,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
        <process expanded="true" height="161" width="614">
          <operator activated="true" class="loop_files" compatibility="5.2.003" expanded="true" height="76" name="Loop Files" width="90" x="313" y="30">
            <parameter key="directory" value="C:\Users\mhelf\tmp\files"/>
            <parameter key="filter" value=".*\.txt"/>
            <process expanded="true" height="619" width="1128">
              <operator activated="true" class="read_csv" compatibility="5.2.003" expanded="true" height="60" name="Read CSV" width="90" x="112" y="30">
                <parameter key="csv_file" value="C:\Users\mhelf\tmp\files\file1.txt"/>
                <parameter key="column_separators" value=":"/>
                <parameter key="first_row_as_names" value="false"/>
                <list key="annotations"/>
                <parameter key="encoding" value="windows-1252"/>
                <list key="data_set_meta_data_information">
                  <parameter key="0" value="att1.true.polynominal.attribute"/>
                </list>
              </operator>
              <operator activated="true" class="extract_macro" compatibility="5.2.003" expanded="true" height="60" name="Extract Macro" width="90" x="246" y="30">
                <parameter key="macro" value="fileId"/>
                <parameter key="macro_type" value="data_value"/>
                <parameter key="attribute_name" value="att1"/>
                <parameter key="example_index" value="1"/>
              </operator>
              <operator activated="true" class="extract_macro" compatibility="5.2.003" expanded="true" height="60" name="Extract Macro (2)" width="90" x="380" y="30">
                <parameter key="macro" value="numRows"/>
              </operator>
              <operator activated="true" class="filter_example_range" compatibility="5.2.003" expanded="true" height="76" name="Filter Example Range" width="90" x="514" y="30">
                <parameter key="first_example" value="2"/>
                <parameter key="last_example" value="%{numRows}"/>
              </operator>
              <operator activated="true" class="split" compatibility="5.2.003" expanded="true" height="76" name="Split" width="90" x="648" y="30"/>
              <operator activated="true" class="generate_attributes" compatibility="5.2.003" expanded="true" height="76" name="Generate Attributes" width="90" x="782" y="30">
                <list key="function_descriptions">
                  <parameter key="fileId" value="%{fileId}"/>
                </list>
              </operator>
              <connect from_port="file object" to_op="Read CSV" to_port="file"/>
              <connect from_op="Read CSV" from_port="output" to_op="Extract Macro" to_port="example set"/>
              <connect from_op="Extract Macro" from_port="example set" to_op="Extract Macro (2)" to_port="example set"/>
              <connect from_op="Extract Macro (2)" from_port="example set" to_op="Filter Example Range" to_port="example set input"/>
              <connect from_op="Filter Example Range" from_port="example set output" to_op="Split" to_port="example set input"/>
              <connect from_op="Split" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_file object" spacing="0"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="5.2.003" expanded="true" height="76" name="Append" width="90" x="447" y="30"/>
          <connect from_op="Loop Files" from_port="out 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • earmijo
    earmijo New Altair Community Member
    Thank you Marius. I think is going to take me a couple of days to understand the process :-)  but it works beautifully.