"Doubts on the text plug-in"
Lorenzo
New Altair Community Member
Good afternoon everyone.
I'm a very newbie in the context of text mining and of rapid miner usage.
I used the text plug-in of rapid miner and I have a few questions for who is so kind to answer.
1) when I process a group of texts I get a big matrix where the items (documents) are the rows and the features (the stems) are the columns. what is the metric that fills each cell (I want to be sure about the meaning of the number inside each cell)? Can I change it?How?
2) what is (I simply want an opinion) the more suitable of these metrics (if there is more than one) to exploit the matrix for clustering analysis?
3) The stemmer and the tokenizer divide my text into words (if the text is "always happy or sad" I'll get the stems corresponding to always, happy...).
Is it possible in RM to work not on a single word but on groups of words (in medical and scientific text very often I have lexicons such as "acetic anhydride" that should be considered as a unique token)?
I apologize because I'm always too verbose
Thanks for your kind attention..hoping that someone can help.
Lorenzo
I'm a very newbie in the context of text mining and of rapid miner usage.
I used the text plug-in of rapid miner and I have a few questions for who is so kind to answer.
1) when I process a group of texts I get a big matrix where the items (documents) are the rows and the features (the stems) are the columns. what is the metric that fills each cell (I want to be sure about the meaning of the number inside each cell)? Can I change it?How?
2) what is (I simply want an opinion) the more suitable of these metrics (if there is more than one) to exploit the matrix for clustering analysis?
3) The stemmer and the tokenizer divide my text into words (if the text is "always happy or sad" I'll get the stems corresponding to always, happy...).
Is it possible in RM to work not on a single word but on groups of words (in medical and scientific text very often I have lexicons such as "acetic anhydride" that should be considered as a unique token)?
I apologize because I'm always too verbose
Thanks for your kind attention..hoping that someone can help.
Lorenzo
Tagged:
0
Answers
-
Hello Lorenzo,
ad 1)
the values are usually the TFIDF (term frequency - inverse document frequency) values for all terms (just google for this). Which tokens are taken into account and how they are changed is subject to the inner operators of the TextInput operators. You can also select only term frequency (without normalizing by the inverse document frequency) or just binary occurences, i.e. a flag indicating if the word is part of the corresponding text or not. The corresponding parameter is called "vector_creation".
ad 2)
Actually, I always use TFIDF since it is the only measurement which "weighs" the terms according to the fact if they are typical for the documents.
ad 3)
Just add the operator "TermNGramGenerator" as additional inner operator to create this type of pairs / tupels.
Cheers,
Ingo0