"Some help for training a regression algorithm [SOLVED]"

manwann
manwann New Altair Community Member
edited November 5 in Community Q&A
Hi dear rapid-i community,

I am testing the rapidminer modeling to make a content-based recommender system. To do that i downloaded the movielens 100K dataset which has information about movies and ratings made by users to movies (http://www.grouplens.org/node/73). The ratings have a range between 0 and 5 and the movies has genre information (action, commedy, etc). I am training a classifier using the user with more ratings  (uid= 405; Number of reviews= 737). To do that I discretize the rating label (good >= 3.5; bad < 3.5) but due that the user has a lot of more reviews with label bad the classifier (libSVM) predicts all labels as bad.

                           true bad                true good                class precision
pre.bad              621                          116                             84.26%
pre.good           0                                0                                  0%
class recall       100%                 0%

So  i used another strategy where I made stratified sampling (http://rapid-i.com/rapidforum/index.php/topic,2190.0.html) to get good and bad labels balanced. I get the following results

                           true bad                true good                class precision
pre.bad              58                           80                              42.03%
pre.good            57                           35                             38.04%
class recall        50.43%                 30.43%


But as you can see the performance obtained is still not good, i really appreciate any suggestion.

Thanks.

Eduardo

Edit: Sorry for the replicated message

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Stratified sampling is usually a good idea in cases like this. But now you have only quite a few training examples left, which is of course bad for the performance. Next, the performance of the SVM depends heavily on good choices for the parameters (espacially C, and in case of the rbf kernel Gamma), and on the kernel you use (good choices are often linear and rbf/radial).
    To optimize them, use an Optimize Parameters (Grid) operator. Good ranges for both C and Gamma are something like 10^-5 - 10^5 on a logarithmic scale.

    Best, Marius
  • manwann
    manwann New Altair Community Member
    Marius thanks for your answer!

    At least now is best  follow the classifier prediction :) (instead of doing the opposite ) The results were

    accuracy 59.13% +/- 7.33%
                                      true bad                        true good                     class precission
    pred.bad                        86                                65                                    56.95%
    pred.good                     29                                 50                                   63.29%
    class recall                  74.78%                        43.48%                        

    Maybe I have to tray the movielens1m dataset.

    Thanks again.