🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Export "Data To Similarity" results to a CSV

User: "ClaraCaba"
New Altair Community Member
Updated by Jocelyn
Hi!

I am working with text mining in Rapidminer and the following problem has arised:

I use the Data to Similarity operator from the Text Extension and the "sim" output port gives a table with three columns: one object, another object, and the similarity between them. However, I can't sort or export that result, which I'd love to do, in order to be able to work with that data as a CSV file.

Is there any way to export that table?

Thank you very much!

Find more posts tagged with

Sort by:
1 - 9 of 91
    Hi,

    you can take Similarity to Data to get an Example Set out of it. Afterwards you can store it with any Write operator.

    ~Martin
    User: "ClaraCaba"
    New Altair Community Member
    OP
    Thank you very much, that worked perfectly.

    I am facing now another problem, though.

    After using the Similarity to Data operator, I have a dataset with three columns: the first id used for comparison, the second id used for comparison, and the similarity percentage. Now, I would like to combine that information with my original database (which has many attributes). I don't know how to, for example, obtain the rows from my original database where the similarity percentage is greater than 50%. Any idea?

    Thank you in advance.
    Hi,

    Use a Filter Examples to delete the examples< 0.5. Afterwards you can join the original data. If you do not have an ID in the dataset, you can use GenerateID before hand to add one.

    ~Martin
    User: "ClaraCaba"
    New Altair Community Member
    OP
    Hi,

    Thank you very much!

    However, I have a last question. I have applied Data to Similarity and then Similarity to Data right after, to be able to use the output dataset. But the dataset contains all results duplicated, since I have applied both operators. How could I prevent this from happening? Or how could I get rid of the duplicated results and just keep a row per similarity between two objects?

    Thank you.
    Hi,

    a general idea is to use Cross Distance, it is a bit more flexible.

    For your question:
    Do i understand it correctly, that you have the distance twice in like this

    ID1  ID2  SIM
    2      1      0.5
    1      2      0.5
    My first idea would be to create a new ID with the Two IDs you have. I would always take the smaller one first. So you always get a string like

    SmallNumber _ BigNumber

    This results in this:

    if([FIRST_ID]>[SECOND_ID],
    concat(str([FIRST_ID]),"_",str([SECOND_ID])),
    concat(str([SECOND_ID]),"_",str([FIRST_ID]))
    )
    Afterwards you can use Remove Duplicates on this. See attached Process

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.0.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="34">
            <parameter key="repository_entry" value="//Samples/data/Golf"/>
          </operator>
          <operator activated="true" class="data_to_similarity" compatibility="7.0.001" expanded="true" height="82" name="Data to Similarity" width="90" x="246" y="34"/>
          <operator activated="true" class="similarity_to_data" compatibility="7.0.001" expanded="true" height="82" name="Similarity to Data" width="90" x="380" y="34"/>
          <operator activated="true" class="generate_attributes" compatibility="7.0.001" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="34">
            <list key="function_descriptions">
              <parameter key="IdToRemoveDuplicates" value="if([FIRST_ID]&gt;[SECOND_ID],&#10;&#9;concat(str([FIRST_ID]),&quot;_&quot;,str([SECOND_ID])),&#10;&#9;concat(str([SECOND_ID]),&quot;_&quot;,str([FIRST_ID]))&#10;)"/>
            </list>
            <description align="center" color="transparent" colored="false" width="126">Create an ID to remove the stuff</description>
          </operator>
          <operator activated="true" class="remove_duplicates" compatibility="7.0.001" expanded="true" height="82" name="Remove Duplicates" width="90" x="648" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="IdToRemoveDuplicates"/>
          </operator>
          <connect from_op="Retrieve Golf" from_port="output" to_op="Data to Similarity" to_port="example set"/>
          <connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data" to_port="similarity"/>
          <connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data" to_port="exampleSet"/>
          <connect from_op="Similarity to Data" from_port="exampleSet" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
          <connect from_op="Remove Duplicates" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    User: "ClaraCaba"
    New Altair Community Member
    OP
    Thank you very very very much!!! :D

    That worked perfectly.
    User: "sangeet171188"
    New Altair Community Member

    And how to get count of similar looking sets( Text field). For the below set I want count like

    ABC is good text -----3

    XYZ is great -----------2

     

    FIRST SECOND SIMILARITY textfield

    1               2                    1           ABC is a good text

    3                8                    1           ABC is a good text

    4                9                      1          ABC is a good text

    12              32                    1            XYZ is great 

    31              77                    1            XYZ is great

    User: "Thomas_Ott"
    New Altair Community Member

    Can't you use an Aggregate operator for this?

    User: "sangeet171188"
    New Altair Community Member

    Thanks Thomas. Results achieved.