Text Classification
Hi there!
I have tried to find something which would help me on this forum but couldn't. Hopefully, someone will answer me and I would be able to solve the issue.
Let me first a bit describe the task. I have 2 datasets, which contain 2 columns: sentence and label. There are 2 possible labels - true or false. I also have 3 dictionaries of phrases (they can be unigrams, bigram, 3-grams,...).
What I want to do:
1) To train SVM classifier on dataset1 and test it on the same dataset (I did it sucessfully with cross-validation).
2) To train SVM classifier on dataset2 and apply the model on dataset1.
3) Use dictionary of phrases as features to dataset1.
My questions:
1) As far as I understand, if I want to train model on one dataset and test it on another, I have to use the same set of features. So I am trying to use the operator "Process documents from data" with the same staff inside (tokenizer, stemming, filtering out stopwords,...) than I take the wordlist of dataset2 and trying to add it as an input to the next "Process documents from data" as a wordlist.
But while running I get this error message:
In WikiTraining I have 10000 sentences, in debates 2000.
But I don't get the problem. Can someone please explain me and how can I avoid it?
2) How can I use separate CSV-files with phrases (let's call it dictionaries) as my features in a dataset? Let's say that my dictionary contains only triggers, which says that this sentence is of class TRUE. How can I do that?
Thank you in advance!