"Text mining from Excel file and Split validation"
hi.
thanks to my teacher I've entered the fantastic world of Rapidminer. I love it, even though Im still a newbie.
Im trying to proceed with a text classification modeling starting with an Excel file with two columns:
Column1 Column2
ROW 1 attribute (text) Label(binomial: simply 0 for negative review and 1 for positive review)
up till now we only work with positive reviews in Txt stored in a folder and negative reviews in Txt stored in another folder, we defined the two of them as positive class and negative class.
I've tried to proceed like this with Read Excel - Process Documents (Tokenize, remove stopwords and case) - Validation (training with SVM + Applay model and Performance)
I've used Nominal to numerical to avoid SVM capacity problems, but as a result I get only the rooted mean square error, in the Performance vector.
I was looking for the Accuracy of my model instead... sorry for the bad question, I hope somenody can help.
Can I use a txt file as an alternative? see attached file.
thanks a lot in advance
Find more posts tagged with
thanks a lot. it works.
I have a question regarding the degree of accuracy. I got 56% here. Is it possible to raise it by adding more reviews in my corpus?
Hopefully the other reviews wont make it worse. do you think it makes sense to work with 2000more reviews from 2 different platforms or would that make things worse?
Thanks a lot again
PerformanceVector:
accuracy: 56.00%
ConfusionMatrix:
True: 0 1
0: 41 29
1: 59 71
AUC: 0.614 (positive class: 1)
There's a lots of ways to possibly improve your classification results. Some right off the bat that could help is pruning, n_grams, and filtering low character words. You might want to review how you tokenize the words too. If you have lots of numbers in the corpus, the default tokenization paramater of 'non letters' will wipe those out.
Next you can use another algo, like Linear SVM or Deep Learning. I would use them in conjuction with a Cross Validation, not Split.
So Text processing is almost an art form as much as it is analytics, it will require some thinking from the domain expert. I don't know what the corpus is that you're trying to classify but sometimes a 3/30% pruning is right, other times 5/80 is good. The short answer is that it depends.
Of course, if you used an Optimize Parameter operator, you could tun the actual pruning percentages to find the optimal % for the best performance measure.
With respect to tokenization, I talk about that in my video here: https://www.youtube.com/watch?v=ia2iV5Ws3zo. I do a lot of Twitter mining so a hastag #datascience would be obliterated using the non-letters parameter. Whereas specify character, I could just split on ".,![]"
hi,
it represents a confidence threshold. ROC is calculated like this.
Take a confidence threshold of 0.99 and calculate TPR/FPR for this - > datapoint
Take a confidence threshold of 0.98 and calculate TPR/FPR for this -> data point
The red curve are the TPR/FPR value. The blue curve are the corresponding thresholds to get this values.
Best,
Martin
hi Martin and thnks for the answer..
I can understand what the ROC curve is but its with the threshold curve (Blue) that I feel confused.
I watched several videos about it, also thought so: when the Threshold is high, I have a higher TPR (coz I "accept" only high predictive probabilities = its easier to get it predicted right), whereas when the threshold is low (for instance <0.5 predictive probability) I see a higher TFR
also, I tried the TF IDF without Prune, and my Acccuracy skyrocketed!
Hi @federico_schiro,
Have you try to use Performance (Classification) or Performance (Binominal Classification) operators
as Performance operator ?
Regards,
Lionel