Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Text Classification

Hi there!

I have tried to find something which would help me on this forum but couldn't. Hopefully, someone will answer me and I would be able to solve the issue.

Let me first a bit describe the task. I have 2 datasets, which contain 2 columns: sentence and label. There are 2 possible labels - true or false. I also have 3 dictionaries of phrases (they can be unigrams, bigram, 3-grams,...).

What I want to do:

1) To train SVM classifier on dataset1 and test it on the same dataset (I did it sucessfully with cross-validation).

2) To train SVM classifier on dataset2 and apply the model on dataset1.

3) Use dictionary of phrases as features to dataset1.

My questions:

1) As far as I understand, if I want to train model on one dataset and test it on another, I have to use the same set of features. So I am trying to use the operator "Process documents from data" with the same staff inside (tokenizer, stemming, filtering out stopwords,...) than I take the wordlist of dataset2 and trying to add it as an input to the next "Process documents from data" as a wordlist.

Снимок экрана 2017-04-29 в 14.55.01.png

But while running I get this error message:

Снимок экрана 2017-04-29 в 14.56.31.png

In WikiTraining I have 10000 sentences, in debates 2000.

But I don't get the problem. Can someone please explain me and how can I avoid it?

2) How can I use separate CSV-files with phrases (let's call it dictionaries) as my features in a dataset? Let's say that my dictionary contains only triggers, which says that this sentence is of class TRUE. How can I do that?

Thank you in advance!

Find more posts tagged with

AI Studio

Classification

Text Mining + NLP

Accepted answers

Thomas_Ott

Ok let me understand a bit better here. Do you want to train a model on those sentences? So you would have a data set with an attribute column of "in a new direction" or "this is terrible" and have the corresponding label "positive" and "negative" respectively associated with it? If yes, you might want to change the parameter on the tokenizer from non-letters to liguistic sentences, and try again.

If not, and you want it to be part of a dictionary, you should use the approach that Martin took here: http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/How-to-Build-a-Dictionary-Based-Sentiment-Model-in-RapidMiner/ta-p/36067

What you would have to do is put them into a CSV file and delimit using a comma or something.

All comments

Thomas_Ott

This error means that whatever you called your label column ended up being a word that was tokenized in your sentences/documents. Rename your label column to something like "_label" and try again.

ar4o

Thank you Thomas_Ott!

I even didn't take into consideration that it can cause a problem but sure! Thank you!

And can anyone give any advice regarding the second question?

Thomas_Ott

Take a look at the knowledgebase for creating your own sentiment dictionary. I'm currently AF K so this is a short reply

ar4o

Couldn't find anything helpful. Only information about using existing dictionaries and most of the adviced are based on installing the extension for a specific dictionary.

ar4o

Does anyone else have some advices or links? Not asking for solutions.

Telcontar120

The Wordnet extension (free in the Marketplace) has an operator that allows you to use a custom sentiment dictionary in the SentiWordnet format. See that extension for more details.

ar4o

Thank you for your reply.

One last question.

The WordNet dictionary is basically... a dictionary where 1 observation is 1 word.

What I need is a bit different — I want to see let's say "some experts", "in a new direction", "some challenges". So 2 or more words as one observation.

So as a result I want to see that each feature of my SVM classifier would be presented as these phrases in brackets above.

Do you have any hint/idea on it as well?

Thomas_Ott

What you would have to do is put them into a CSV file and delimit using a comma or something.

ar4o

Haven't tried yet your proposal but it sound like what I have been looking for!

Thank you very much!