Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Text Mining - Word Similarity

Hey,

I want to find the similarity between words used in a collection of articles; like which words have been used together more often than others. There are softwares like Automap and WordStat which are able to that; but the first doesn't consider the non-english letters (which is important for my case) and the latter is expensive!

I'm trying RM now and I noticed that it has the document similarity operator, but doesn't have one in a word-level. I gave a shot for association rules, but the ones that it finds didn't make much sense for my articles; like also-->able with probability 0.75

So I've decided to construct my own similarity model as below:

Process Documents from files ==> Wordlist to Data ==> Data to Similarity ==> Similarity to Data ==> Write Excel

The resulting table included the similarities between words as I wanted but there is double counting. For example, the similarity between the word #1068 and #963 appears twice like this:

FIRST_ID SECOND_ID DISTANCE
963 1 068 103
1 068 963 103

This makes my results two times bigger than it should be, and it complicates the visualisations.

I couldn't find a thread about this double-counting in the forum, but I could use some help.

Thank you

Find more posts tagged with

AI Studio

Text Mining + NLP

Accepted answers

All comments

jansudes

Hey,

Well actually my intention is to find word co-occurences within a collection of documents, really. Is there anyone who has done such a project in Rapidminer?

MariusHelf

Hi,

in Process Documents, did you remove stopwords with the Filter Stopwords operator? That will most likely remove frequent words such as "also", "and", "I" etc. and thus clean up your association rules a bit.
Furthermore, to use FPGrowth and Association Rules you most probably want to use the "binary occurences" mode for the word vector creation in Process Documents.

Best regards,
Marius