Arrange list of names by similarity?

User: "DAVID_EALES"
New Altair Community Member
Updated by Jocelyn

Hi All,

 

I am a complete novice with RapidMiner and despite watching muliple videos and trawling the forum, I am unable to get my head around how to solve what I think is a very simple problem!

 

I have a list of names (approx 5k), all I want to achieve is to sort this list of names by similarity. 

 

All that I have process wise so far is....

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Local Repository/email test"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.1.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="name_recipients"/>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="8.1.001" expanded="true" height="82" name="Data to Similarity" width="90" x="514" y="136"/>
<connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

 

I would be most grateful for anyone's assistance.

 

Kind Regards

 

 

Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "lionelderkrikor"
    New Altair Community Member
    Accepted Answer

    Hi again @DAVID_EALES,

     

    Interesting but difficult task.....

    I found a ressource which seems interesting for your project in the community.

     

    To sum up, you can use the Deduplicate Names operator of the Rosette Text Analytics extension.

    This extension must be installed from Marketplace. Moreover, you must obtain an API key to use this extension.

     

    Tested like this with your (very partial) example set : 

    Cluster_names.png

     

    this process give the following result : 

    Cluster_names_2.png

    I hope it will be useful.

     

    Regards,

     

    Lionel