Text Mining - Documents Similarity (words position)

Question

Hello,

I'm looking for a way to get the similarity between documents, but where the words positions is relevant.
I've already implemented the sample with "Data Similarity" operator (CosineSimilarity) like:
https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/How-to-compare-similarity-of-large-number-of-documents/td-p/16002
But I need to take into account the order/position of words, not only frecuency or occurrence. 
I.E:
Example 1: A B C D E F G
Example 2: A X B D Y F G
Example 3: G F E A B C D

Example 1 and 2 have more similarity than Example 1 and 3 because although Example 3 has exactly the same words than Example 1 (CosineSimilarity=1), they are in different position. Example 2 only has two different words (X,Y), and other word in other position but near the original position...

I think is a problem difficult to explain and I'm not sure if RapidMiner can give me a solution.

Best regards,
Silvia

YYH · Answer

Hi @silviabastos

Thanks for the followup! Maybe you can try word2vec for document with 900+ words?

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

Training on a single corpus the word2vec algorithm will generate one multidimensional vector for each word. These vectors are known to have symantic meanings that help you understand the position and context of each word.

You can install word2vec extensions from marketplace.

HTH!

YY

silviabastos · Answer

Hi!

I will try both options.

Related to @yyhuang solution, I only wrote a small example in the first post, the texts I'm working have natural language, about 900 words, so I'm not sure if I can use it.

Related to @Telcontar120 solution, I make one first attempt, but I didn't get consistent results.

I will work a little more io this and I will post the found problems.

Any other solutions are wellcome.

Thank you.

YYH · Answer

Hi @silviabastos This is a great questions. To 'remember' to location of the key words, you can use "generate nGrams" for phrases search with term max length for 7 + and of course it will need more time for text processing. Supppose you do not have many words in each document, ideally just like the examples showed in your message, we have three documents as simple as A B C D E F GA X B D Y F GG F E A B C D You can use the levenshtein distance offered in Dr Martin Schmitz's toolbox extension. https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_operator_toolbox The Levenshtein distance is calculated as the number of changes needed to convert one string into the other. A common use case for this distance is spell checking. Here is the xml of my process. HTH! YY A B C D E F G A X B D Y F G G F E A B C D