[Solved] Comparing Examplesets by rows

Question

Hi I have 2 Databases and i would like to calculate the similarity between row number 1 of dataset1 with row number 1 of dataset 2 and row 2 with row 2 and so on . Both datasets contain text data.

thank you very much for your help!!

maxfax · Answer

THank you very much for your help :)  this is what i did - and it works -- Thought i could save some processing time by not calculating all the distances which wont be needed afterwards. Exampleset with 50k rows takes some time ;-)

Anyway this will work !! thank you !

MariusHelf · Answer

Concerning the comparison, you can use Cross Distances nevertheless and then filter only those rows from its result where request and document are equal, as described above.

The Cross Distances operator expects that both example sets contain the same attributes. For text processing, especially the Process Documents operators, that means that you have to use the same wordlist to create both tf/idf sets. You probably use Process Documents for both the left and the right data of your comparison data. To use the same word vector for the right data set as for the left dataset, just connect the WordVector output of the left Process Documents operator to the respective input of the right Process Documents operator.

If you have problems, please let me know.

Best regards,
Marius

MariusHelf · Answer

maxfax  wrote:
In addition i would like to extract some data from the exampleset matching a certain criteria. For example just like in an SQL Statement i Would like to Extract those Examples where the Column Request = the Column Document.

Is this possible and how ?

Thank you very much !!

You can use the operators Generate Attributes and Filter Examples for that: in Generate Attributes, create a new attribute "indicator" with the formula "if(Request == Document, 1, 0)", and then configure the Filter Examples to use the expression filter with an expression like "indicator = 1".