Arrange list of names by similarity?
Hi All,
I am a complete novice with RapidMiner and despite watching muliple videos and trawling the forum, I am unable to get my head around how to solve what I think is a very simple problem!
I have a list of names (approx 5k), all I want to achieve is to sort this list of names by similarity.
All that I have process wise so far is....
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Local Repository/email test"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.1.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="name_recipients"/>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="8.1.001" expanded="true" height="82" name="Data to Similarity" width="90" x="514" y="136"/>
<connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
I would be most grateful for anyone's assistance.
Kind Regards
Best Answer
-
Hi again @DAVID_EALES,
Interesting but difficult task.....
I found a ressource which seems interesting for your project in the community.
To sum up, you can use the Deduplicate Names operator of the Rosette Text Analytics extension.
This extension must be installed from Marketplace. Moreover, you must obtain an API key to use this extension.
Tested like this with your (very partial) example set :
this process give the following result :
I hope it will be useful.
Regards,
Lionel
0
Answers
-
Hi @DAVID_EALES,
Here a process, which compute and sort the Distance between the names of a list, using the Data to Similarity operator :
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.0.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="85">
<parameter key="generator_type" value="comma_separated_text"/>
<list key="function_descriptions"/>
<list key="numeric_series_configuration"/>
<list key="date_series_configuration"/>
<list key="date_series_configuration (interval)"/>
<parameter key="input_csv_text" value="Att1 Michael, Lionel, John, Jordan, Bruce, Dan, Jordan, Michel"/>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="8.1.003" expanded="true" height="82" name="Data to Similarity" width="90" x="313" y="85">
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<operator activated="true" class="similarity_to_data" compatibility="8.1.003" expanded="true" height="82" name="Similarity to Data" width="90" x="447" y="85"/>
<operator activated="true" class="sort" compatibility="8.1.003" expanded="true" height="82" name="Sort" width="90" x="581" y="85">
<parameter key="attribute_name" value="DISTANCE"/>
</operator>
<connect from_op="Create ExampleSet" from_port="output" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data" to_port="similarity"/>
<connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data" to_port="exampleSet"/>
<connect from_op="Similarity to Data" from_port="exampleSet" to_op="Sort" to_port="example set input"/>
<connect from_op="Sort" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>I don't know your dataset and what exactly you want to do, but, in case of nominal attributes (the names in your case), the distance will be always 0 (in case of perfect matching between
the 2 names, in other words the 2 names are the same) or 1 (in the other cases). So your table will be filled only with "1" and "0".
Regards,
Lionel
1 -
In the free Operator Toolbox extension, there is an operator to Generate Levenshtein Distance, which is more in line with I think what you want to do. But I am not sure exactly what you mean by sorting the list because to do that you would first have to select one name as the reference name to which all other names' similarity would be calculated.
1 -
Thanks to all for your replies thus far
To explain further, I want to group/cluster? email addresses based on similarity rather than alphabetically so for example....
Alphabetical sort....
1joe.bloggs@domain.com a.user@domain.com another.person@domain.com joe.bloggs@domain.com k@domain.com soe.blogs@domain.com What I am trying to achieve....
a.user@domain.com
another.person@domain.com
1joe.bloggs@domain.com
joe.bloggs@domain.com
soe.blogs@domain.com
k@domain.comI understand about the distance measurement, but how do I take that distance measurement and use it to rearrange the output?
Hope the above makes sense.
Kind Regards
0 -
Hi again @DAVID_EALES,
Interesting but difficult task.....
I found a ressource which seems interesting for your project in the community.
To sum up, you can use the Deduplicate Names operator of the Rosette Text Analytics extension.
This extension must be installed from Marketplace. Moreover, you must obtain an API key to use this extension.
Tested like this with your (very partial) example set :
this process give the following result :
I hope it will be useful.
Regards,
Lionel
0 -
Many Thanks Lionel, your idea worked.
Kind Regards
0 -
Ok, so the solution proposed by Lionel worked during testing, but I am unable to get it to run through the entire list as I am getting Error 504.
I have split the data into batches of 1000 rows and it all processes fine but I need it to be able to process the entire list of 5k entries at once.
Is this somesort of timeout error? I have looked at the rosette documentation and I cant find any mention of it.
Kind Regards
0 -
Hi @DAVID_EALES,
Accordind to your last message, It's working for dataset up to 1K rows --> OK
But : normaly, it work with dataset up to 10k rows grasiously (see the documentation (description) of RapidMiner)).
I contacted the support of Rosette to see what's going on with this error (error504).(maybe an updated limitation...)
Regards,
Lionel
1 -
0
-
Thank You Lionel, much appreciated.
0