Export "Data To Similarity" results to a CSV

ClaraCaba
ClaraCaba New Altair Community Member
edited November 2024 in Community Q&A
Hi!

I am working with text mining in Rapidminer and the following problem has arised:

I use the Data to Similarity operator from the Text Extension and the "sim" output port gives a table with three columns: one object, another object, and the similarity between them. However, I can't sort or export that result, which I'd love to do, in order to be able to work with that data as a CSV file.

Is there any way to export that table?

Thank you very much!
Tagged:

Welcome!

It looks like you're new here. Sign in or register to get started.

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,

    you can take Similarity to Data to get an Example Set out of it. Afterwards you can store it with any Write operator.

    ~Martin
  • ClaraCaba
    ClaraCaba New Altair Community Member
    Thank you very much, that worked perfectly.

    I am facing now another problem, though.

    After using the Similarity to Data operator, I have a dataset with three columns: the first id used for comparison, the second id used for comparison, and the similarity percentage. Now, I would like to combine that information with my original database (which has many attributes). I don't know how to, for example, obtain the rows from my original database where the similarity percentage is greater than 50%. Any idea?

    Thank you in advance.
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,

    Use a Filter Examples to delete the examples< 0.5. Afterwards you can join the original data. If you do not have an ID in the dataset, you can use GenerateID before hand to add one.

    ~Martin
  • ClaraCaba
    ClaraCaba New Altair Community Member
    Hi,

    Thank you very much!

    However, I have a last question. I have applied Data to Similarity and then Similarity to Data right after, to be able to use the output dataset. But the dataset contains all results duplicated, since I have applied both operators. How could I prevent this from happening? Or how could I get rid of the duplicated results and just keep a row per similarity between two objects?

    Thank you.
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,

    a general idea is to use Cross Distance, it is a bit more flexible.

    For your question:
    Do i understand it correctly, that you have the distance twice in like this

    ID1  ID2  SIM
    2      1      0.5
    1      2      0.5
    My first idea would be to create a new ID with the Two IDs you have. I would always take the smaller one first. So you always get a string like

    SmallNumber _ BigNumber

    This results in this:

    if([FIRST_ID]>[SECOND_ID],
    concat(str([FIRST_ID]),"_",str([SECOND_ID])),
    concat(str([SECOND_ID]),"_",str([FIRST_ID]))
    )
    Afterwards you can use Remove Duplicates on this. See attached Process

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.0.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="34">
            <parameter key="repository_entry" value="//Samples/data/Golf"/>
          </operator>
          <operator activated="true" class="data_to_similarity" compatibility="7.0.001" expanded="true" height="82" name="Data to Similarity" width="90" x="246" y="34"/>
          <operator activated="true" class="similarity_to_data" compatibility="7.0.001" expanded="true" height="82" name="Similarity to Data" width="90" x="380" y="34"/>
          <operator activated="true" class="generate_attributes" compatibility="7.0.001" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="34">
            <list key="function_descriptions">
              <parameter key="IdToRemoveDuplicates" value="if([FIRST_ID]&gt;[SECOND_ID],&#10;&#9;concat(str([FIRST_ID]),&quot;_&quot;,str([SECOND_ID])),&#10;&#9;concat(str([SECOND_ID]),&quot;_&quot;,str([FIRST_ID]))&#10;)"/>
            </list>
            <description align="center" color="transparent" colored="false" width="126">Create an ID to remove the stuff</description>
          </operator>
          <operator activated="true" class="remove_duplicates" compatibility="7.0.001" expanded="true" height="82" name="Remove Duplicates" width="90" x="648" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="IdToRemoveDuplicates"/>
          </operator>
          <connect from_op="Retrieve Golf" from_port="output" to_op="Data to Similarity" to_port="example set"/>
          <connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data" to_port="similarity"/>
          <connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data" to_port="exampleSet"/>
          <connect from_op="Similarity to Data" from_port="exampleSet" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
          <connect from_op="Remove Duplicates" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • ClaraCaba
    ClaraCaba New Altair Community Member
    Thank you very very very much!!! :D

    That worked perfectly.
  • sangeet171188
    sangeet171188 New Altair Community Member

    And how to get count of similar looking sets( Text field). For the below set I want count like

    ABC is good text -----3

    XYZ is great -----------2

     

    FIRST SECOND SIMILARITY textfield

    1               2                    1           ABC is a good text

    3                8                    1           ABC is a good text

    4                9                      1          ABC is a good text

    12              32                    1            XYZ is great 

    31              77                    1            XYZ is great

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Can't you use an Aggregate operator for this?

  • sangeet171188
    sangeet171188 New Altair Community Member

    Thanks Thomas. Results achieved.

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.