Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

"Some help for training a regression algorithm [SOLVED]"

Hi dear rapid-i community,

I am testing the rapidminer modeling to make a content-based recommender system. To do that i downloaded the movielens 100K dataset which has information about movies and ratings made by users to movies (http://www.grouplens.org/node/73). The ratings have a range between 0 and 5 and the movies has genre information (action, commedy, etc). I am training a classifier using the user with more ratings (uid= 405; Number of reviews= 737). To do that I discretize the rating label (good >= 3.5; bad < 3.5) but due that the user has a lot of more reviews with label bad the classifier (libSVM) predicts all labels as bad.

true bad true good class precision
pre.bad 621 116 84.26%
pre.good 0 0 0%
class recall 100% 0%

So i used another strategy where I made stratified sampling (http://rapid-i.com/rapidforum/index.php/topic,2190.0.html) to get good and bad labels balanced. I get the following results

true bad true good class precision
pre.bad 58 80 42.03%
pre.good 57 35 38.04%
class recall 50.43% 30.43%

But as you can see the performance obtained is still not good, i really appreciate any suggestion.

Thanks.

Eduardo

Edit: Sorry for the replicated message

Find more posts tagged with

AI Studio

Regression

Algorithms

Accepted answers

All comments

MariusHelf

Stratified sampling is usually a good idea in cases like this. But now you have only quite a few training examples left, which is of course bad for the performance. Next, the performance of the SVM depends heavily on good choices for the parameters (espacially C, and in case of the rbf kernel Gamma), and on the kernel you use (good choices are often linear and rbf/radial).
To optimize them, use an Optimize Parameters (Grid) operator. Good ranges for both C and Gamma are something like 10^-5 - 10^5 on a logarithmic scale.

Best, Marius

manwann

Marius thanks for your answer!

At least now is best follow the classifier prediction

(instead of doing the opposite ) The results were

accuracy 59.13% +/- 7.33%
true bad true good class precission
pred.bad 86 65 56.95%
pred.good 29 50 63.29%
class recall 74.78% 43.48%

Maybe I have to tray the movielens1m dataset.

Thanks again.