"split operator"
nidis
New Altair Community Member
Hi all,
I'm havving probblems with the split transformation.
Example wise, my data set is:
ID, File name
I'm trying to extract the extension from the file name and create a new attribute with it.
and end up with the following:
ID, File name, File extension
It just seems not to be doing anything. When I write the output to a csv the file is excactlly the same as the original, no new attributes are generated.
Can any body advise?
Thanks
I'm havving probblems with the split transformation.
Example wise, my data set is:
ID, File name
I'm trying to extract the extension from the file name and create a new attribute with it.
and end up with the following:
ID, File name, File extension
It just seems not to be doing anything. When I write the output to a csv the file is excactlly the same as the original, no new attributes are generated.
Can any body advise?
Thanks
0
Answers
-
Hi,
what did you use as split pattern? A simple dot? Be aware, that this pattern uses regular expressions, so if you want to split at the dot you have to escape it with a backslash (otherwise it is used as meta sign in the expression): \.
Afterwards you have to rename the attributes, if you want to have the names you posted. Perhaps you should take the "Generate Extract" operator into consideration, which allows naming the new attributes directly and also takes something from an existing attribute via regular expression.
Regards,
Matthias0 -
Hi colo,
Thanks for the reply.
If i escape the dot I get 20 new attributeseedom when choosing file names is not always good SNIF
Using the follwing expression:
\.[^.]*$
Wich should match the last dot followed by any character and the end of the string.
But the result set is exacttly the same
0 -
Hi nidis,
I suppose the whole pattern match is used for the split. If you have a filename "some.filename.ext" your pattern will match everything from the last dot: ".ext". So the filename is splitted at every occurrence of ".ext" which always is the last part, so that the part after the split always will be empty. If you still don't want to switch to the "Generate Extract" operator you must use assertions (because the matches are not considered in the overall match). If you want to match the last dot in a string you could use \.(?!.*\.)
Regards,
Matthias0 -
Mathias,
Thanks a lot, that did the trick.
As for the generate extract operator, I can't find it under data transformation operators.
Cheers
miguel0 -
Hi Miguel,
you can find the "Generate Extract" under Data Transformation, Attribute Set Reduction and Transformation, Generation. But it's quite simpler to use the operator search feature
If you want to use it, here is a little example:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
The first capturing group is used as value for the new attribute.
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
<process expanded="true" height="235" width="279">
<operator activated="true" class="read_csv" compatibility="5.0.8" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
<parameter key="file_name" value="file:/C:/Dokumente%20und%20Einstellungen/mraeder/Desktop/Test.csv"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="text:generate_extract" compatibility="5.0.6" expanded="true" height="60" name="Generate Extract" width="90" x="179" y="30">
<parameter key="source_attribute" value="File name"/>
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="File Extension" value=".*\.(.*)"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Generate Extract" to_port="Example Set"/>
<connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Regards,
Matthias0 -
Hi,
note that "Generate Extract" is in the Text Mining Extension.
Best,
Simon0 -
I'm sorry, I wasn't aware of this. Thanks for mentioning, Simon.
Perhaps it would be a simple and slight increase of overview to add a note to the operator description which tells about the operator's belonging to an extension?
Regards,
Matthias0 -
Hi,
if you post a process into the forum or to myExperiment, the XML code will contain the information from which extension an operator comes. RapidMiner will offer a quick fix to install missing extensions when opening such processes.
In RapidMiner itself, extensions are mostly color-coded.
Best,
Simon0