"ETL - extracting data from text-space delimited file"

sgenzer
sgenzer
Altair Employee
edited November 5 in Community Q&A
Hi all -

I am trying to extract data from a very old file that is organized by placement of character per line (and uses spaces to fill in gaps), instead of CSV.  For example:

1 GENZER          SCOTT          76        33      KERWIN HILL ROAD            NORWICH
2 JOHNSON      JIM                81        4        MAIN ST                                NEW YORK CITY
3 CLINTON        BILL              66        16      APPLEBY AVE                      LITTLE ROCK

So the 1st attribute (ID) is at the 1st character, the 2nd attribute (LAST NAME) is from the 3rd character to the 17th character, the 3rd attribute (FIRST NAME) is from the 18th character and so on.  Is there any elegant way to extract these data?  The only way I can think to do this is to multiply the input, rename the attribute, and then join (XML below).  Anything better?  Each line of this datafile is about 2000 characters long with about 50 attributes.  There is some pattern to the attributes but still...

Thanks all!

Scott


<operator activated="true" class="loop_files" compatibility="5.3.015" expanded="true" height="76" name="Loop Files" width="90" x="179" y="75">
        <parameter key="directory" value="/Users/foo/>
        <parameter key="filter" value="*.dat"/>
        <process expanded="true">
          <operator activated="true" class="read_csv" compatibility="5.3.015" expanded="true" height="60" name="Read CSV (2)" width="90" x="45" y="30">
            <parameter key="csv_file" value="/Users/foo/03105302.DAT"/>
            <parameter key="trim_lines" value="true"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="5.3.015" expanded="true" height="94" name="Multiply (2)" width="90" x="179" y="30"/>
          <operator activated="true" class="extract_macro_from_annotation" compatibility="5.3.015" expanded="true" height="60" name="Extract Macro from Annotation (2)" width="90" x="179" y="165">
            <parameter key="macro" value="filename"/>
            <parameter key="annotation" value="Source"/>
          </operator>
          <operator activated="true" class="subprocess" compatibility="5.3.015" expanded="true" height="76" name="INFO EXTRACT (2)" width="90" x="313" y="30">
            <process expanded="true">
              <operator activated="true" class="multiply" compatibility="5.3.015" expanded="true" height="166" name="Multiply (3)" width="90" x="112" y="30"/>
              <operator activated="true" class="cut" compatibility="5.3.015" expanded="true" height="76" name="Cut (77)" width="90" x="380" y="30">
                <parameter key="first_character_index" value="125"/>
                <parameter key="last_character_index" value="125"/>
              </operator>
              <operator activated="true" class="generate_id" compatibility="5.3.015" expanded="true" height="76" name="Generate ID (77)" width="90" x="514" y="30"/>
              <operator activated="true" class="rename" compatibility="5.3.015" expanded="true" height="76" name="Rename (73)" width="90" x="648" y="30">
                <parameter key="old_name" value="att1"/>
                <parameter key="new_name" value="Gender"/>
                <list key="rename_additional_attributes"/>
              </operator>
              <operator activated="true" class="cut" compatibility="5.3.015" expanded="true" height="76" name="Cut (78)" width="90" x="380" y="120">
                <parameter key="first_character_index" value="115"/>
                <parameter key="last_character_index" value="115"/>
              </operator>
              <operator activated="true" class="generate_id" compatibility="5.3.015" expanded="true" height="76" name="Generate ID (78)" width="90" x="514" y="120"/>
              <operator activated="true" class="rename" compatibility="5.3.015" expanded="true" height="76" name="Rename (74)" width="90" x="648" y="120">
                <parameter key="old_name" value="att1"/>
                <parameter key="new_name" value="Middle Initial"/>
                <list key="rename_additional_attributes"/>
              </operator>
              <operator activated="true" class="cut" compatibility="5.3.015" expanded="true" height="76" name="Cut (79)" width="90" x="380" y="210">
                <parameter key="first_character_index" value="107"/>
                <parameter key="last_character_index" value="114"/>
              </operator>
              <operator activated="true" class="generate_id" compatibility="5.3.015" expanded="true" height="76" name="Generate ID (79)" width="90" x="514" y="210"/>
              <operator activated="true" class="rename" compatibility="5.3.015" expanded="true" height="76" name="Rename (75)" width="90" x="648" y="210">
                <parameter key="old_name" value="att1"/>
                <parameter key="new_name" value="First Name"/>
                <list key="rename_additional_attributes"/>
              </operator>
              <operator activated="true" class="cut" compatibility="5.3.015" expanded="true" height="76" name="Cut (80)" width="90" x="380" y="300">
                <parameter key="first_character_index" value="95"/>
                <parameter key="last_character_index" value="106"/>
              </operator>
              <operator activated="true" class="generate_id" compatibility="5.3.015" expanded="true" height="76" name="Generate ID (80)" width="90" x="514" y="300"/>
              <operator activated="true" class="rename" compatibility="5.3.015" expanded="true" height="76" name="Rename (76)" width="90" x="648" y="300">
                <parameter key="old_name" value="att1"/>
                <parameter key="new_name" value="Last Name"/>
                <list key="rename_additional_attributes"/>
              </operator>
              <operator activated="true" class="cut" compatibility="5.3.015" expanded="true" height="76" name="Cut (81)" width="90" x="380" y="390">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="att1"/>
                <parameter key="first_character_index" value="55"/>
                <parameter key="last_character_index" value="94"/>
              </operator>
              <operator activated="true" class="generate_id" compatibility="5.3.015" expanded="true" height="76" name="Generate ID (81)" width="90" x="514" y="390"/>
              <operator activated="true" class="rename" compatibility="5.3.015" expanded="true" height="76" name="Rename (77)" width="90" x="648" y="390">
                <parameter key="old_name" value="att1"/>
                <parameter key="new_name" value="school name"/>
                <list key="rename_additional_attributes"/>
              </operator>
              <operator activated="true" class="cut" compatibility="5.3.015" expanded="true" height="76" name="Cut (82)" width="90" x="380" y="480">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="att1"/>
                <parameter key="last_character_index" value="28"/>
              </operator>
              <operator activated="true" class="generate_id" compatibility="5.3.015" expanded="true" height="76" name="Generate ID (82)" width="90" x="514" y="480"/>
              <operator activated="true" class="rename" compatibility="5.3.015" expanded="true" height="76" name="Rename (78)" width="90" x="648" y="480">
                <parameter key="old_name" value="att1"/>
                <parameter key="new_name" value="initial numerical data"/>
                <list key="rename_additional_attributes"/>
              </operator>
              <operator activated="true" class="join" compatibility="5.3.015" expanded="true" height="76" name="Join (45)" width="90" x="782" y="120">
                <list key="key_attributes"/>
              </operator>
              <operator activated="true" class="join" compatibility="5.3.015" expanded="true" height="76" name="Join (44)" width="90" x="782" y="210">
                <list key="key_attributes"/>
              </operator>
              <operator activated="true" class="join" compatibility="5.3.015" expanded="true" height="76" name="Join (43)" width="90" x="782" y="300">
                <list key="key_attributes"/>
              </operator>
              <operator activated="true" class="join" compatibility="5.3.015" expanded="true" height="76" name="Join (42)" width="90" x="782" y="390">
                <list key="key_attributes"/>
              </operator>
              <operator activated="true" class="join" compatibility="5.3.015" expanded="true" height="76" name="Join (2)" width="90" x="782" y="480">
                <list key="key_attributes"/>
              </operator>
              <connect from_port="in 1" to_op="Multiply (3)" to_port="input"/>
              <connect from_op="Multiply (3)" from_port="output 1" to_op="Cut (77)" to_port="example set input"/>
              <connect from_op="Multiply (3)" from_port="output 2" to_op="Cut (78)" to_port="example set input"/>
              <connect from_op="Multiply (3)" from_port="output 3" to_op="Cut (79)" to_port="example set input"/>
              <connect from_op="Multiply (3)" from_port="output 4" to_op="Cut (80)" to_port="example set input"/>
              <connect from_op="Multiply (3)" from_port="output 5" to_op="Cut (81)" to_port="example set input"/>
              <connect from_op="Multiply (3)" from_port="output 6" to_op="Cut (82)" to_port="example set input"/>
              <connect from_op="Cut (77)" from_port="example set output" to_op="Generate ID (77)" to_port="example set input"/>
              <connect from_op="Generate ID (77)" from_port="example set output" to_op="Rename (73)" to_port="example set input"/>
              <connect from_op="Rename (73)" from_port="example set output" to_op="Join (45)" to_port="left"/>
              <connect from_op="Cut (78)" from_port="example set output" to_op="Generate ID (78)" to_port="example set input"/>
              <connect from_op="Generate ID (78)" from_port="example set output" to_op="Rename (74)" to_port="example set input"/>
              <connect from_op="Rename (74)" from_port="example set output" to_op="Join (45)" to_port="right"/>
              <connect from_op="Cut (79)" from_port="example set output" to_op="Generate ID (79)" to_port="example set input"/>
              <connect from_op="Generate ID (79)" from_port="example set output" to_op="Rename (75)" to_port="example set input"/>
              <connect from_op="Rename (75)" from_port="example set output" to_op="Join (44)" to_port="right"/>
              <connect from_op="Cut (80)" from_port="example set output" to_op="Generate ID (80)" to_port="example set input"/>
              <connect from_op="Generate ID (80)" from_port="example set output" to_op="Rename (76)" to_port="example set input"/>
              <connect from_op="Rename (76)" from_port="example set output" to_op="Join (43)" to_port="right"/>
              <connect from_op="Cut (81)" from_port="example set output" to_op="Generate ID (81)" to_port="example set input"/>
              <connect from_op="Generate ID (81)" from_port="example set output" to_op="Rename (77)" to_port="example set input"/>
              <connect from_op="Rename (77)" from_port="example set output" to_op="Join (42)" to_port="right"/>
              <connect from_op="Cut (82)" from_port="example set output" to_op="Generate ID (82)" to_port="example set input"/>
              <connect from_op="Generate ID (82)" from_port="example set output" to_op="Rename (78)" to_port="example set input"/>
              <connect from_op="Rename (78)" from_port="example set output" to_op="Join (2)" to_port="right"/>
              <connect from_op="Join (45)" from_port="join" to_op="Join (44)" to_port="left"/>
              <connect from_op="Join (44)" from_port="join" to_op="Join (43)" to_port="left"/>
              <connect from_op="Join (43)" from_port="join" to_op="Join (42)" to_port="left"/>
              <connect from_op="Join (42)" from_port="join" to_op="Join (2)" to_port="left"/>
              <connect from_op="Join (2)" from_port="join" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    I think regex is your friend.

    I read in your file as csv with a regex as delimiter

    \s+
    if it is one or more white spaces or

    \s{2,}
    if it is 2 or more whitespaces (etc.).

    This works quite well. Alternativly you can read the file and use a regex later on
  • sgenzer
    sgenzer
    Altair Employee
    thanks so much.  I should have remembered that!

    Unfortunately that does not really solve much for me...the files often "max out" the space allocations such that one field abuts another:

    1 GENZER        SCOTT      76        33      KERWIN HILL ROAD            NORWICH
    2 JOHNSON      JIM        81        4        MAIN ST                      NEW YORK CITY
    3 CLINTON      BILL        66        16      APPLEBY AVE                  LITTLE ROCK
    ...
    44HERNANDEZ-JIMEJOSE-EDUARDO51        12      MAIN ST                      NEW YORK CITY

    I also have numeric fields that abut like this:

    128866673

    where it should be

    12,88,666,73 (four separate attributes)

    more ideas?

    Thanks again!

    Scott
  • MartinLiebig
    MartinLiebig
    Altair Employee
    if you have some kind of schema, you might be able to have clever regular expressions.

    The "Rename" operator in combination with the "Split" operator is your friend in this case.
  • sgenzer
    sgenzer
    Altair Employee
    Yes I tried to use split but could not figure out how to create a correct split pattern.  How do I do that in this case?
  • MartinLiebig
    MartinLiebig
    Altair Employee
    My first approach would be to add , at the places where i need to split. For example for

    44HERNANDEZJIMEJOSE-EDUARDO51
    i would use the regex

    ([0-9]+)([\w-]*)([0-9]+)
    and replace it with

    $1,$2,$3
    Afterwards it is possible to split on the ,