handling duplicated columns, but with text !

Nada_Faisal1991
Nada_Faisal1991 New Altair Community Member
edited November 5 in Community Q&A
hello there fellow miners,

I'm a Rapidminer beginner, and I am trying to detect then delete duplicated columns for an example set that holds text rather than numbers.
with numbers it was easy, removing correlation did the job perfectly.
but things got complicated with text, is there a way where I can either a) do something similar to the correlation removal in numbers or b) convert the text to numbers but keep the columns intact rather than splitting them by value like the output of the process "Nominal to Numerical" ?

thank you. :)

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    if you want to delete exact duplicate texts you can use the Remove Duplicates operator. If you want to remove similar texts like:
    "RapidMiner is great!" and "rapidminer is great", then it gets a bit more tricky.
    Best,
    Martin
  • Nada_Faisal1991
    Nada_Faisal1991 New Altair Community Member
    edited September 2020
    mschmitz  thank you for the response but Filter Example did not work with me, or at least I did not know how to make it work.

    If I have a table with the following text content :

          1             2           3
    -------------------------------------
    l   Tree   l   Fruit   l    Tree   l
    l   Fruit   l   Fruit   l    Fruit   l
    l   Fruit   l   True   l    Fruit   l
    l   Tree   l   Tree   l    Tree   l
    -------------------------------------

    I need to remove column 3 or know that column 3 is the exact duplicate of column 1,

    thank you all for your wisdom.
  • Vanlal
    Vanlal New Altair Community Member
    Hi,
      You can use the below process
    <?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.6.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="136">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="1,2,3&#10;Tree,Fruit,Tree&#10;Fruit,Fruit,Fruit&#10;Fruit,True,Fruit&#10;Tree,Tree,Tree"/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="transpose" compatibility="9.6.000" expanded="true" height="82" name="Transpose" width="90" x="179" y="136"/>
          <operator activated="true" class="remove_duplicates" compatibility="9.6.000" expanded="true" height="103" name="Remove Duplicates" width="90" x="313" y="136">
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="treat_missing_values_as_duplicates" value="false"/>
          </operator>
          <operator activated="true" class="transpose" compatibility="9.6.000" expanded="true" height="82" name="Transpose (2)" width="90" x="447" y="136"/>
          <operator activated="true" class="select_attributes" compatibility="9.6.000" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="136">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="id"/>
            <parameter key="attributes" value="id"/>
            <parameter key="regular_expression" value="id"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="true"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Transpose" to_port="example set input"/>
          <connect from_op="Transpose" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
          <connect from_op="Remove Duplicates" from_port="example set output" to_op="Transpose (2)" to_port="example set input"/>
          <connect from_op="Transpose (2)" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    
    Hope this is the solution ..
  • Nada_Faisal1991
    Nada_Faisal1991 New Altair Community Member
    edited September 2020
    Vanlal   thank you so much that worked  :smiley: