Export "Data To Similarity" results to a CSV

ClaraCaba
New Altair Community Member
Hi!
I am working with text mining in Rapidminer and the following problem has arised:
I use the Data to Similarity operator from the Text Extension and the "sim" output port gives a table with three columns: one object, another object, and the similarity between them. However, I can't sort or export that result, which I'd love to do, in order to be able to work with that data as a CSV file.
Is there any way to export that table?
Thank you very much!
I am working with text mining in Rapidminer and the following problem has arised:
I use the Data to Similarity operator from the Text Extension and the "sim" output port gives a table with three columns: one object, another object, and the similarity between them. However, I can't sort or export that result, which I'd love to do, in order to be able to work with that data as a CSV file.
Is there any way to export that table?
Thank you very much!
0
Answers
-
Hi,
you can take Similarity to Data to get an Example Set out of it. Afterwards you can store it with any Write operator.
~Martin0 -
Thank you very much, that worked perfectly.
I am facing now another problem, though.
After using the Similarity to Data operator, I have a dataset with three columns: the first id used for comparison, the second id used for comparison, and the similarity percentage. Now, I would like to combine that information with my original database (which has many attributes). I don't know how to, for example, obtain the rows from my original database where the similarity percentage is greater than 50%. Any idea?
Thank you in advance.0 -
Hi,
Use a Filter Examples to delete the examples< 0.5. Afterwards you can join the original data. If you do not have an ID in the dataset, you can use GenerateID before hand to add one.
~Martin0 -
Hi,
Thank you very much!
However, I have a last question. I have applied Data to Similarity and then Similarity to Data right after, to be able to use the output dataset. But the dataset contains all results duplicated, since I have applied both operators. How could I prevent this from happening? Or how could I get rid of the duplicated results and just keep a row per similarity between two objects?
Thank you.0 -
Hi,
a general idea is to use Cross Distance, it is a bit more flexible.
For your question:
Do i understand it correctly, that you have the distance twice in like this
My first idea would be to create a new ID with the Two IDs you have. I would always take the smaller one first. So you always get a string like
ID1 ID2 SIM
2 1 0.5
1 2 0.5
SmallNumber _ BigNumber
This results in this:
Afterwards you can use Remove Duplicates on this. See attached Process
if([FIRST_ID]>[SECOND_ID],
concat(str([FIRST_ID]),"_",str([SECOND_ID])),
concat(str([SECOND_ID]),"_",str([FIRST_ID]))
)
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="34">
<parameter key="repository_entry" value="//Samples/data/Golf"/>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="7.0.001" expanded="true" height="82" name="Data to Similarity" width="90" x="246" y="34"/>
<operator activated="true" class="similarity_to_data" compatibility="7.0.001" expanded="true" height="82" name="Similarity to Data" width="90" x="380" y="34"/>
<operator activated="true" class="generate_attributes" compatibility="7.0.001" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="34">
<list key="function_descriptions">
<parameter key="IdToRemoveDuplicates" value="if([FIRST_ID]>[SECOND_ID], 	concat(str([FIRST_ID]),"_",str([SECOND_ID])), 	concat(str([SECOND_ID]),"_",str([FIRST_ID])) )"/>
</list>
<description align="center" color="transparent" colored="false" width="126">Create an ID to remove the stuff</description>
</operator>
<operator activated="true" class="remove_duplicates" compatibility="7.0.001" expanded="true" height="82" name="Remove Duplicates" width="90" x="648" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="IdToRemoveDuplicates"/>
</operator>
<connect from_op="Retrieve Golf" from_port="output" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data" to_port="similarity"/>
<connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data" to_port="exampleSet"/>
<connect from_op="Similarity to Data" from_port="exampleSet" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
<connect from_op="Remove Duplicates" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Thank you very very very much!!!
That worked perfectly.0 -
And how to get count of similar looking sets( Text field). For the below set I want count like
ABC is good text -----3
XYZ is great -----------2
FIRST SECOND SIMILARITY textfield
1 2 1 ABC is a good text
3 8 1 ABC is a good text
4 9 1 ABC is a good text
12 32 1 XYZ is great
31 77 1 XYZ is great
0 -
Can't you use an Aggregate operator for this?
0 -
Thanks Thomas. Results achieved.
0