"Personalized Selection of Terms (Words)"

Question

Hi, Everybody

I don't if the topic is in correct place. Anyway...

Is it possible to make a filtering of the terms as follows in RapidMiner?

Supose:

NA - Number of occurrence of a word in Class A
NB - Number of occurrence of a word in Class B
NC - Number of occurrence of a word in Class C
Total = NA + NB + NC

They remain the terms that meet the following criteria:

(NA / Total) * 100%> X% or (NB / Total) * 100%> Y% or (CN / Total) * 100%> Z%

Is it possible?

gustavo_medeiro · Answer

Classes are separated into three: "label1 = -1" (dow jones low), "label1 = 0" (neutral) and "label = 1" (the high DowJones).

We can call class-1, class 0 and class 1.

I already removed all irrelevant symbols. On the recommendation of another post I had not seen before (it's the first time I actually use the forum and was not sure how it works), said to use "generate attribuite", "filter examples."

I know the path is something like this, but the question is how to fill the fields in these boxes. I have no idea how to do the RapidMiner understand that want to limit the words with similar frequencies in terms of occurrence.

For me, for example, it does not help me the same word occurs 33% in class 1;  33% in class 0 and 33% in class -1. Because if a word match is balanced in various classes, it does not work. Because this word is irrelevant for all classes.
Did you understand me?

JEdward · Answer

Yes it is.

How are you splitting up your words into Class A, Class B & Class C?  
Do you have dictionary lists of wordX = A, wordY = A, wordZ = C?  Or is it a model you use?

Within your document processing you can use Stem(Dictionary) or Stem(Wordnet) to do this and replace all occurances of certain words into ClassA, ClassB & ClassC and then remove all other tokens (as they are, I presume irrelevant to your analysis).

Another option is after your Process Documents have completed you could use Generate Aggregate or Generate Attributes to get the sums.  
Does that help?