"Some help for training a regression algorithm [SOLVED]"
manwann
New Altair Community Member
Hi dear rapid-i community,
I am testing the rapidminer modeling to make a content-based recommender system. To do that i downloaded the movielens 100K dataset which has information about movies and ratings made by users to movies (http://www.grouplens.org/node/73). The ratings have a range between 0 and 5 and the movies has genre information (action, commedy, etc). I am training a classifier using the user with more ratings (uid= 405; Number of reviews= 737). To do that I discretize the rating label (good >= 3.5; bad < 3.5) but due that the user has a lot of more reviews with label bad the classifier (libSVM) predicts all labels as bad.
true bad true good class precision
pre.bad 621 116 84.26%
pre.good 0 0 0%
class recall 100% 0%
So i used another strategy where I made stratified sampling (http://rapid-i.com/rapidforum/index.php/topic,2190.0.html) to get good and bad labels balanced. I get the following results
true bad true good class precision
pre.bad 58 80 42.03%
pre.good 57 35 38.04%
class recall 50.43% 30.43%
But as you can see the performance obtained is still not good, i really appreciate any suggestion.
Thanks.
Eduardo
Edit: Sorry for the replicated message
I am testing the rapidminer modeling to make a content-based recommender system. To do that i downloaded the movielens 100K dataset which has information about movies and ratings made by users to movies (http://www.grouplens.org/node/73). The ratings have a range between 0 and 5 and the movies has genre information (action, commedy, etc). I am training a classifier using the user with more ratings (uid= 405; Number of reviews= 737). To do that I discretize the rating label (good >= 3.5; bad < 3.5) but due that the user has a lot of more reviews with label bad the classifier (libSVM) predicts all labels as bad.
true bad true good class precision
pre.bad 621 116 84.26%
pre.good 0 0 0%
class recall 100% 0%
So i used another strategy where I made stratified sampling (http://rapid-i.com/rapidforum/index.php/topic,2190.0.html) to get good and bad labels balanced. I get the following results
true bad true good class precision
pre.bad 58 80 42.03%
pre.good 57 35 38.04%
class recall 50.43% 30.43%
But as you can see the performance obtained is still not good, i really appreciate any suggestion.
Thanks.
Eduardo
Edit: Sorry for the replicated message
Tagged:
0
Answers
-
Stratified sampling is usually a good idea in cases like this. But now you have only quite a few training examples left, which is of course bad for the performance. Next, the performance of the SVM depends heavily on good choices for the parameters (espacially C, and in case of the rbf kernel Gamma), and on the kernel you use (good choices are often linear and rbf/radial).
To optimize them, use an Optimize Parameters (Grid) operator. Good ranges for both C and Gamma are something like 10^-5 - 10^5 on a logarithmic scale.
Best, Marius0 -
Marius thanks for your answer!
At least now is best follow the classifier prediction (instead of doing the opposite ) The results were
accuracy 59.13% +/- 7.33%
true bad true good class precission
pred.bad 86 65 56.95%
pred.good 29 50 63.29%
class recall 74.78% 43.48%
Maybe I have to tray the movielens1m dataset.
Thanks again.0