Text Classification
Hi there!
I have tried to find something which would help me on this forum but couldn't. Hopefully, someone will answer me and I would be able to solve the issue.
Let me first a bit describe the task. I have 2 datasets, which contain 2 columns: sentence and label. There are 2 possible labels - true or false. I also have 3 dictionaries of phrases (they can be unigrams, bigram, 3-grams,...).
What I want to do:
1) To train SVM classifier on dataset1 and test it on the same dataset (I did it sucessfully with cross-validation).
2) To train SVM classifier on dataset2 and apply the model on dataset1.
3) Use dictionary of phrases as features to dataset1.
My questions:
1) As far as I understand, if I want to train model on one dataset and test it on another, I have to use the same set of features. So I am trying to use the operator "Process documents from data" with the same staff inside (tokenizer, stemming, filtering out stopwords,...) than I take the wordlist of dataset2 and trying to add it as an input to the next "Process documents from data" as a wordlist.
But while running I get this error message:
In WikiTraining I have 10000 sentences, in debates 2000.
But I don't get the problem. Can someone please explain me and how can I avoid it?
2) How can I use separate CSV-files with phrases (let's call it dictionaries) as my features in a dataset? Let's say that my dictionary contains only triggers, which says that this sentence is of class TRUE. How can I do that?
Thank you in advance!
Find more posts tagged with

Thank you for your reply.
One last question.
The WordNet dictionary is basically... a dictionary where 1 observation is 1 word.
What I need is a bit different — I want to see let's say "some experts", "in a new direction", "some challenges". So 2 or more words as one observation.
So as a result I want to see that each feature of my SVM classifier would be presented as these phrases in brackets above.
Do you have any hint/idea on it as well?
Ok let me understand a bit better here. Do you want to train a model on those sentences? So you would have a data set with an attribute column of "in a new direction" or "this is terrible" and have the corresponding label "positive" and "negative" respectively associated with it? If yes, you might want to change the parameter on the tokenizer from non-letters to liguistic sentences, and try again.
If not, and you want it to be part of a dictionary, you should use the approach that Martin took here: http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/How-to-Build-a-Dictionary-Based-Sentiment-Model-in-RapidMiner/ta-p/36067
What you would have to do is put them into a CSV file and delimit using a comma or something.
Ok let me understand a bit better here. Do you want to train a model on those sentences? So you would have a data set with an attribute column of "in a new direction" or "this is terrible" and have the corresponding label "positive" and "negative" respectively associated with it? If yes, you might want to change the parameter on the tokenizer from non-letters to liguistic sentences, and try again.
If not, and you want it to be part of a dictionary, you should use the approach that Martin took here: http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/How-to-Build-a-Dictionary-Based-Sentiment-Model-in-RapidMiner/ta-p/36067
What you would have to do is put them into a CSV file and delimit using a comma or something.