"split operator"

nidis
nidis New Altair Community Member
edited November 5 in Community Q&A
Hi all,
I'm havving probblems with the split transformation.
Example wise, my data set is:
ID, File name
I'm trying to extract the extension from the file name and create a new attribute with it.
and end up with the following:
ID, File name, File extension

It just seems not to be doing anything. When I write the output to a csv the file is excactlly the same as the original, no new attributes are generated.

Can any body advise?
Thanks
Tagged:

Answers

  • colo
    colo New Altair Community Member
    Hi,

    what did you use as split pattern? A simple dot? Be aware, that this pattern uses regular expressions, so if you want to split at the dot you have to escape it with a backslash (otherwise it is used as meta sign in the expression): \.

    Afterwards you have to rename the attributes, if you want to have the names you posted. Perhaps you should take the "Generate Extract" operator into consideration, which allows naming the new attributes directly and also takes something from an existing attribute via regular expression.

    Regards,
    Matthias
  • nidis
    nidis New Altair Community Member
    Hi colo,
    Thanks for the reply.
    If i escape the dot I get 20 new attributeseedom when choosing file names is not always good SNIF
    Using the follwing expression:
    \.[^.]*$
    Wich should match the last dot followed by any character and the end of the string.
    But the result set is exacttly the same
  • colo
    colo New Altair Community Member
    Hi nidis,

    I suppose the whole pattern match is used for the split. If you have a filename "some.filename.ext" your pattern will match everything from the last dot: ".ext". So the filename is splitted at every occurrence of ".ext" which always is the last part, so that the part after the split always will be empty. If you still don't want to switch to the "Generate Extract" operator you must use assertions (because the matches are not considered in the overall match). If you want to match the last dot in a string you could use \.(?!.*\.)

    Regards,
    Matthias
  • nidis
    nidis New Altair Community Member
    Mathias,
    Thanks a lot, that did the trick.
    As for the generate extract operator, I can't find it under data transformation operators.
    Cheers
      miguel
  • colo
    colo New Altair Community Member
    Hi Miguel,

    you can find the "Generate Extract" under Data Transformation, Attribute Set Reduction and Transformation, Generation. But it's quite simpler to use the operator search feature ;)

    If you want to use it, here is a little example:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
       <process expanded="true" height="235" width="279">
         <operator activated="true" class="read_csv" compatibility="5.0.8" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
           <parameter key="file_name" value="file:/C:/Dokumente%20und%20Einstellungen/mraeder/Desktop/Test.csv"/>
           <list key="data_set_meta_data_information"/>
         </operator>
         <operator activated="true" class="text:generate_extract" compatibility="5.0.6" expanded="true" height="60" name="Generate Extract" width="90" x="179" y="30">
           <parameter key="source_attribute" value="File name"/>
           <parameter key="query_type" value="Regular Expression"/>
           <list key="string_machting_queries"/>
           <list key="regular_expression_queries">
             <parameter key="File Extension" value=".*\.(.*)"/>
           </list>
           <list key="regular_region_queries"/>
           <list key="xpath_queries"/>
           <list key="namespaces"/>
           <list key="index_queries"/>
         </operator>
         <connect from_op="Read CSV" from_port="output" to_op="Generate Extract" to_port="Example Set"/>
         <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
    The first capturing group is used as value for the new attribute.

    Regards,
    Matthias
  • fischer
    fischer New Altair Community Member
    Hi,

    note that "Generate Extract" is in the Text Mining Extension.

    Best,
    Simon
  • colo
    colo New Altair Community Member
    I'm sorry, I wasn't aware of this. Thanks for mentioning, Simon.

    Perhaps it would be a simple and slight increase of overview to add a note to the operator description which tells about the operator's belonging to an extension?

    Regards,
    Matthias
  • fischer
    fischer New Altair Community Member
    Hi,

    if you post a process into the forum or to myExperiment, the XML code will contain the information from which extension an operator comes. RapidMiner will offer a quick fix to install missing extensions when opening such processes.

    In RapidMiner itself, extensions are mostly color-coded.

    Best,
    Simon