🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

TF-IDF and Aspect grouping with Rapid Miner

User: "HeikoeWin786"
New Altair Community Member
Updated by Jocelyn
Dear all,
I am new to RapidMiner and I got few questions really seeking your kind support.
I have a airline dataset with labelled data of sentiment (pos, neg, and netural). 
I had divided the dataset 75/25 data split and perform the text processing (i.e. nominal to text, data to document, preprocess document with tokenization, stopwords).
Q1: However, when the result out in word from preprocess document operator, I found the neg,pos and netural data columns have all zero value. Is this normal or am I missing something?
Q2: I want to perform the aspect categorization i.e. I have 5 topics as aspect groups (e.g. flight, service, ...) and the output of TF-IDF consists of the highest frequency words, and those words I want to group under the 5 topics. After that, I will perform Navies Bayes Classification to know the sentiment classification for each aspects. Is there any efficient way I can perform this in RapidMiner?

I am a really starter in Rapidminer and i am so sorry if I am asking very basic questions. But, I do hope your kind support in helping me to learn this.

Thanks and regarda,
Hikoe

Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "kayman"
    New Altair Community Member
    Accepted Answer
    In the workflow, the pre-process --> TF-IDF = it is what "process document" operator does right? exa output provide the dataset that will be an input for NBC. And the word output from the operator is just for the analysis of TF-IDF, correct? So, for NBC, we need to use the exa output of "process operator"

    • Yes, but you can also use this word list as a filter for your unseen (new) data. Not needed in case of NBC, some other models require this. It's always good practice 

    But the performance matrix (confusion matrix) is to test on training model or test model? Bec I see it shows 83% for training but 0% for test

    • You will have to ignore these figures as they were created by a non optimal workflow and are therefore not useful. I think in the example I provided the x-fold will give you both training and test accuracy, the single one only the test one. This is anyway the most important but it's always good to compare training and test accuracy to see if there are major differences.