How to split an attribute based on a condition on the split pattern ?

lionelderkrikor
lionelderkrikor New Altair Community Member
edited November 5 in Community Q&A
Hi,

I'm extracting usernames of e-mails and I want to split these usernames according to the 
separator between the first name and the last name. (the separator  is different for each username).

For example here the initial dataset : 

Username
john.doe
John_Doe

I want to obtain the following dataset : 

Username_1          Username_2     
john                               doe
John                              Doe


For this I tried to use the Branch operator but I'm encountered an error.

Here my process : 
<?xml version="1.0" encoding="UTF-8"?><process version="9.3.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.3.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="utility:create_exampleset" compatibility="9.3.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="85">
        <parameter key="generator_type" value="comma separated text"/>
        <parameter key="number_of_examples" value="100"/>
        <parameter key="use_stepsize" value="false"/>
        <list key="function_descriptions"/>
        <parameter key="add_id_attribute" value="false"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="input_csv_text" value="Username&#10;john.doe&#10;John_Doe"/>
        <parameter key="column_separator" value=","/>
        <parameter key="parse_all_as_nominal" value="false"/>
        <parameter key="decimal_point_character" value="."/>
        <parameter key="trim_attribute_names" value="true"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.3.000" expanded="true" height="103" name="Multiply (2)" width="90" x="313" y="85"/>
      <operator activated="true" breakpoints="before" class="branch" compatibility="9.3.000" expanded="true" height="103" name="Branch" width="90" x="514" y="85">
        <parameter key="condition_type" value="expression"/>
        <parameter key="condition_value" value="[Username]==john.doe"/>
        <parameter key="expression" value="contains([Username],&quot;.&quot;)==TRUE"/>
        <parameter key="io_object" value="ANOVAMatrix"/>
        <parameter key="return_inner_output" value="true"/>
        <process expanded="true">
          <operator activated="true" class="multiply" compatibility="9.3.000" expanded="true" height="103" name="Multiply (3)" width="90" x="45" y="238"/>
          <operator activated="true" class="select_attributes" compatibility="9.3.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="238">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Username"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" breakpoints="before" class="split" compatibility="9.3.000" expanded="true" height="82" name="Split (2)" width="90" x="179" y="136">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Username"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="split_pattern" value="[.]"/>
            <parameter key="split_mode" value="ordered_split"/>
          </operator>
          <operator activated="true" class="union" compatibility="9.3.000" expanded="true" height="82" name="Union" width="90" x="380" y="136"/>
          <connect from_port="condition" to_port="input 1"/>
          <connect from_port="input 1" to_op="Multiply (3)" to_port="input"/>
          <connect from_op="Multiply (3)" from_port="output 1" to_op="Split (2)" to_port="example set input"/>
          <connect from_op="Multiply (3)" from_port="output 2" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Union" to_port="example set 2"/>
          <connect from_op="Split (2)" from_port="example set output" to_op="Union" to_port="example set 1"/>
          <connect from_op="Union" from_port="union" to_port="input 2"/>
          <portSpacing port="source_condition" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_input 1" spacing="0"/>
          <portSpacing port="sink_input 2" spacing="0"/>
          <portSpacing port="sink_input 3" spacing="0"/>
        </process>
        <process expanded="true">
          <connect from_port="condition" to_port="input 1"/>
          <connect from_port="input 1" to_port="input 2"/>
          <portSpacing port="source_condition" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_input 1" spacing="0"/>
          <portSpacing port="sink_input 2" spacing="0"/>
          <portSpacing port="sink_input 3" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="branch" compatibility="9.3.000" expanded="true" height="103" name="Branch (2)" width="90" x="648" y="85">
        <parameter key="condition_type" value="expression"/>
        <parameter key="condition_value" value="Username==John_doe"/>
        <parameter key="expression" value="contains([Username],&quot;_&quot;)==TRUE"/>
        <parameter key="io_object" value="ANOVAMatrix"/>
        <parameter key="return_inner_output" value="true"/>
        <process expanded="true">
          <operator activated="true" breakpoints="after" class="split" compatibility="9.3.000" expanded="true" height="82" name="Split" width="90" x="179" y="136">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Username"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="split_pattern" value="[_]"/>
            <parameter key="split_mode" value="ordered_split"/>
          </operator>
          <connect from_port="condition" to_port="input 1"/>
          <connect from_port="input 1" to_op="Split" to_port="example set input"/>
          <connect from_op="Split" from_port="example set output" to_port="input 2"/>
          <portSpacing port="source_condition" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_input 1" spacing="0"/>
          <portSpacing port="sink_input 2" spacing="0"/>
          <portSpacing port="sink_input 3" spacing="0"/>
        </process>
        <process expanded="true">
          <connect from_port="condition" to_port="input 1"/>
          <connect from_port="input 1" to_port="input 2"/>
          <portSpacing port="source_condition" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_input 1" spacing="0"/>
          <portSpacing port="sink_input 2" spacing="0"/>
          <portSpacing port="sink_input 3" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Multiply (2)" to_port="input"/>
      <connect from_op="Multiply (2)" from_port="output 1" to_op="Branch" to_port="condition"/>
      <connect from_op="Multiply (2)" from_port="output 2" to_op="Branch" to_port="input 1"/>
      <connect from_op="Branch" from_port="input 1" to_op="Branch (2)" to_port="condition"/>
      <connect from_op="Branch" from_port="input 2" to_op="Branch (2)" to_port="input 1"/>
      <connect from_op="Branch (2)" from_port="input 2" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Can you help me ?

Regards,

Lionel




Best Answers

  • kayman
    kayman New Altair Community Member
    Answer ✓
    Seems more like a bug with the branch operator, as it should recognize the attribute to start with.

    As for your issue, why don't you just replace all known separator symbols with an underscore using a regex? I'd assume there are not that many apart from the dot that are generally used in email addresses. And then the split would be on all for the underscore.
  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    I think the solution from @kayman is the easiest; since there are only a few common email separators like "." and "-" and "_" then they can be replaced easily by a single one and then just use that for the split.
     

Answers

  • kayman
    kayman New Altair Community Member
    Answer ✓
    Seems more like a bug with the branch operator, as it should recognize the attribute to start with.

    As for your issue, why don't you just replace all known separator symbols with an underscore using a regex? I'd assume there are not that many apart from the dot that are generally used in email addresses. And then the split would be on all for the underscore.
  • jacobcybulski
    jacobcybulski New Altair Community Member
    edited June 2019
    Try this, a bit simpler, a sequence of two attribute generators based on a regular expression, matching the first and the second component, i.e.
    • replaceAll(name,"^([a-z0-9]+)[-_+]([a-z0-9]+)$","$1")
    • replaceAll(name,"^([a-z0-9]+)[-_+]([a-z0-9]+)$","$2")
    You can adjust the regular expression to put any separators in the middle.


  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    I think the solution from @kayman is the easiest; since there are only a few common email separators like "." and "-" and "_" then they can be replaced easily by a single one and then just use that for the split.
     
  • sgenzer
    sgenzer
    Altair Employee
    yes I would concur with @kayman @Telcontar120 this is exactly how I would approach this problem: Split using RegEx.

    Scott
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Dear all,

    Thanks you for your contributions.
    In deed, @kayman solution is giving good results on my original dataset and solves this problem.
    Once again thanks you for spending time on this problem.

    Regards,

    Lionel