"Text Mining with Excel File"

joshua_gelhaar
joshua_gelhaar New Altair Community Member
edited November 5 in Community Q&A

Hi,

I have an excel file filled with e-mail adresses in one colomn. Now I want to add one column in which the adresses are grouped. For example @abc is group 1 @dfg is group 2 and so on. I thought about using text mining for the adresses but I already failed to switch the excel file in a document with data to documents.

Hoping for help.

Greetings,

Joshua

Best Answer

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓

    Hi,

     

    so you want to extract the domain of an email address? If yes - you can do this with Replace. Attached is an example process.

     

    Cheers,

    Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="8.0.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="179" y="85">
    <list key="attribute_values">
    <parameter key="mail" value="&quot;name@domain.com&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_copy" compatibility="8.0.001" expanded="true" height="82" name="Generate Copy" width="90" x="313" y="85">
    <parameter key="attribute_name" value="mail"/>
    <parameter key="new_name" value="domain"/>
    </operator>
    <operator activated="true" class="replace" compatibility="8.0.001" expanded="true" height="82" name="Replace" width="90" x="447" y="85">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="domain"/>
    <parameter key="replace_what" value=".+@(.+)"/>
    <parameter key="replace_by" value="$1"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Generate Copy" to_port="example set input"/>
    <connect from_op="Generate Copy" from_port="example set output" to_op="Replace" to_port="example set input"/>
    <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓

    Hi,

     

    so you want to extract the domain of an email address? If yes - you can do this with Replace. Attached is an example process.

     

    Cheers,

    Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="8.0.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="179" y="85">
    <list key="attribute_values">
    <parameter key="mail" value="&quot;name@domain.com&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_copy" compatibility="8.0.001" expanded="true" height="82" name="Generate Copy" width="90" x="313" y="85">
    <parameter key="attribute_name" value="mail"/>
    <parameter key="new_name" value="domain"/>
    </operator>
    <operator activated="true" class="replace" compatibility="8.0.001" expanded="true" height="82" name="Replace" width="90" x="447" y="85">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="domain"/>
    <parameter key="replace_what" value=".+@(.+)"/>
    <parameter key="replace_by" value="$1"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Generate Copy" to_port="example set input"/>
    <connect from_op="Generate Copy" from_port="example set output" to_op="Replace" to_port="example set input"/>
    <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • joshua_gelhaar
    joshua_gelhaar New Altair Community Member

    Hello Martin,

    thank you for your answer. I think the Replace Operator is not the right one for my concern. In my excel list I have a lot of emails from different companies. Now want to add another column and group them. So that all emails with @company1 will get 1 and @company2 will get the no 2 in the new column.

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi,

     

    have a look at the process i have posted. It will give you a new attribute called domain with "company1.com" for the one and "company2.com" for the other.

     

    Best,

    Martin

  • joshua_gelhaar
    joshua_gelhaar New Altair Community Member

    Okay, thank you. How can I use/copy your process in my RapidMiner?

  • MartinLiebig
    MartinLiebig
    Altair Employee
  • joshua_gelhaar
    joshua_gelhaar New Altair Community Member

    Thank you! Looks like a great solution, I will try!