How to split an attribute based on a condition on the split pattern ?
lionelderkrikor
New Altair Community Member
Hi,
I'm extracting usernames of e-mails and I want to split these usernames according to the
separator between the first name and the last name. (the separator is different for each username).
For example here the initial dataset :
Username
john.doe
John_Doe
I want to obtain the following dataset :
Username_1 Username_2
john doe
John Doe
For this I tried to use the Branch operator but I'm encountered an error.
Here my process :
Can you help me ?
Regards,
Lionel
I'm extracting usernames of e-mails and I want to split these usernames according to the
separator between the first name and the last name. (the separator is different for each username).
For example here the initial dataset :
Username
john.doe
John_Doe
I want to obtain the following dataset :
Username_1 Username_2
john doe
John Doe
For this I tried to use the Branch operator but I'm encountered an error.
Here my process :
<?xml version="1.0" encoding="UTF-8"?><process version="9.3.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.3.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.3.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="85"> <parameter key="generator_type" value="comma separated text"/> <parameter key="number_of_examples" value="100"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"/> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="input_csv_text" value="Username john.doe John_Doe"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="multiply" compatibility="9.3.000" expanded="true" height="103" name="Multiply (2)" width="90" x="313" y="85"/> <operator activated="true" breakpoints="before" class="branch" compatibility="9.3.000" expanded="true" height="103" name="Branch" width="90" x="514" y="85"> <parameter key="condition_type" value="expression"/> <parameter key="condition_value" value="[Username]==john.doe"/> <parameter key="expression" value="contains([Username],".")==TRUE"/> <parameter key="io_object" value="ANOVAMatrix"/> <parameter key="return_inner_output" value="true"/> <process expanded="true"> <operator activated="true" class="multiply" compatibility="9.3.000" expanded="true" height="103" name="Multiply (3)" width="90" x="45" y="238"/> <operator activated="true" class="select_attributes" compatibility="9.3.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="238"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="Username"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" breakpoints="before" class="split" compatibility="9.3.000" expanded="true" height="82" name="Split (2)" width="90" x="179" y="136"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="Username"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="nominal"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="file_path"/> <parameter key="block_type" value="single_value"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="single_value"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="split_pattern" value="[.]"/> <parameter key="split_mode" value="ordered_split"/> </operator> <operator activated="true" class="union" compatibility="9.3.000" expanded="true" height="82" name="Union" width="90" x="380" y="136"/> <connect from_port="condition" to_port="input 1"/> <connect from_port="input 1" to_op="Multiply (3)" to_port="input"/> <connect from_op="Multiply (3)" from_port="output 1" to_op="Split (2)" to_port="example set input"/> <connect from_op="Multiply (3)" from_port="output 2" to_op="Select Attributes" to_port="example set input"/> <connect from_op="Select Attributes" from_port="example set output" to_op="Union" to_port="example set 2"/> <connect from_op="Split (2)" from_port="example set output" to_op="Union" to_port="example set 1"/> <connect from_op="Union" from_port="union" to_port="input 2"/> <portSpacing port="source_condition" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_input 1" spacing="0"/> <portSpacing port="sink_input 2" spacing="0"/> <portSpacing port="sink_input 3" spacing="0"/> </process> <process expanded="true"> <connect from_port="condition" to_port="input 1"/> <connect from_port="input 1" to_port="input 2"/> <portSpacing port="source_condition" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_input 1" spacing="0"/> <portSpacing port="sink_input 2" spacing="0"/> <portSpacing port="sink_input 3" spacing="0"/> </process> </operator> <operator activated="true" class="branch" compatibility="9.3.000" expanded="true" height="103" name="Branch (2)" width="90" x="648" y="85"> <parameter key="condition_type" value="expression"/> <parameter key="condition_value" value="Username==John_doe"/> <parameter key="expression" value="contains([Username],"_")==TRUE"/> <parameter key="io_object" value="ANOVAMatrix"/> <parameter key="return_inner_output" value="true"/> <process expanded="true"> <operator activated="true" breakpoints="after" class="split" compatibility="9.3.000" expanded="true" height="82" name="Split" width="90" x="179" y="136"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="Username"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="nominal"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="file_path"/> <parameter key="block_type" value="single_value"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="single_value"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="split_pattern" value="[_]"/> <parameter key="split_mode" value="ordered_split"/> </operator> <connect from_port="condition" to_port="input 1"/> <connect from_port="input 1" to_op="Split" to_port="example set input"/> <connect from_op="Split" from_port="example set output" to_port="input 2"/> <portSpacing port="source_condition" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_input 1" spacing="0"/> <portSpacing port="sink_input 2" spacing="0"/> <portSpacing port="sink_input 3" spacing="0"/> </process> <process expanded="true"> <connect from_port="condition" to_port="input 1"/> <connect from_port="input 1" to_port="input 2"/> <portSpacing port="source_condition" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_input 1" spacing="0"/> <portSpacing port="sink_input 2" spacing="0"/> <portSpacing port="sink_input 3" spacing="0"/> </process> </operator> <connect from_op="Create ExampleSet" from_port="output" to_op="Multiply (2)" to_port="input"/> <connect from_op="Multiply (2)" from_port="output 1" to_op="Branch" to_port="condition"/> <connect from_op="Multiply (2)" from_port="output 2" to_op="Branch" to_port="input 1"/> <connect from_op="Branch" from_port="input 1" to_op="Branch (2)" to_port="condition"/> <connect from_op="Branch" from_port="input 2" to_op="Branch (2)" to_port="input 1"/> <connect from_op="Branch (2)" from_port="input 2" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Can you help me ?
Regards,
Lionel
Tagged:
0
Best Answers
-
Seems more like a bug with the branch operator, as it should recognize the attribute to start with.
As for your issue, why don't you just replace all known separator symbols with an underscore using a regex? I'd assume there are not that many apart from the dot that are generally used in email addresses. And then the split would be on all for the underscore.3 -
Jacob, this would basically only work if you have only one sur and last name. Granted, having multiple of these would be a problem anyway, but if you want to split on all of these you need to use a more flexible pattern.5
-
I think the solution from @kayman is the easiest; since there are only a few common email separators like "." and "-" and "_" then they can be replaced easily by a single one and then just use that for the split.
2
Answers
-
Seems more like a bug with the branch operator, as it should recognize the attribute to start with.
As for your issue, why don't you just replace all known separator symbols with an underscore using a regex? I'd assume there are not that many apart from the dot that are generally used in email addresses. And then the split would be on all for the underscore.3 -
Try this, a bit simpler, a sequence of two attribute generators based on a regular expression, matching the first and the second component, i.e.
- replaceAll(name,"^([a-z0-9]+)[-_+]([a-z0-9]+)$","$1")
- replaceAll(name,"^([a-z0-9]+)[-_+]([a-z0-9]+)$","$2")
You can adjust the regular expression to put any separators in the middle.
1 -
Jacob, this would basically only work if you have only one sur and last name. Granted, having multiple of these would be a problem anyway, but if you want to split on all of these you need to use a more flexible pattern.5
-
I think the solution from @kayman is the easiest; since there are only a few common email separators like "." and "-" and "_" then they can be replaced easily by a single one and then just use that for the split.
2 -
yes I would concur with @kayman @Telcontar120 this is exactly how I would approach this problem: Split using RegEx.
Scott1 -
Dear all,
Thanks you for your contributions.
In deed, @kayman solution is giving good results on my original dataset and solves this problem.
Once again thanks you for spending time on this problem.
Regards,
Lionel
1