"SVM Regression returning same values for all test records ?!?!"

noah977
noah977 New Altair Community Member
edited November 5 in Community Q&A
I setup a nice SVM using nu-svr in RM.
As a test I trained it on a a sparse data set containing 1000 records.

Then, I tested it against a new data set of about 14 records.

Every record of the test set returned the exact same prediction.  This seems highly unlikely since there are over 140 dimensions to the SVM and a significant amount of variation in the data. 

One guess is that maybe I'm not loading in the sparse data correctly for testing.

I can't seem to discover where my error is.  Maybe someone here can offer some help/suggestions.

Here is the training XML
<?xml version="1.0" encoding="MacRoman"?>
<process version="4.3">

  <operator name="Root" class="Process" expanded="yes">
      <operator name="SparseFormatExampleSource" class="SparseFormatExampleSource">
          <parameter key="data_file" value="/Users/noah/train_sparse.txt"/>
          <parameter key="dimension" value="140"/>
          <parameter key="format" value="yx"/>
      </operator>
      <operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="yes">
          <parameter key="attribute_name_regex" value="label"/>
          <parameter key="condition_class" value="is_nominal"/>
          <parameter key="process_special_attributes" value="true"/>
          <operator name="NominalNumbers2Numerical" class="NominalNumbers2Numerical">
          </operator>
      </operator>
      <operator name="LibSVMLearner" class="LibSVMLearner">
          <parameter key="C" value="100.0"/>
          <parameter key="gamma" value="0.1"/>
          <parameter key="keep_example_set" value="true"/>
          <parameter key="svm_type" value="nu-SVR"/>
      </operator>
      <operator name="ModelWriter" class="ModelWriter">
          <parameter key="model_file" value="/Users/noah/sparse_small.mod"/>
      </operator>
      <operator name="ModelApplier" class="ModelApplier">
          <list key="application_parameters">
          </list>
          <parameter key="create_view" value="true"/>
          <parameter key="keep_model" value="true"/>
      </operator>
      <operator name="RegressionPerformance" class="RegressionPerformance">
          <parameter key="absolute_error" value="true"/>
          <parameter key="keep_example_set" value="true"/>
          <parameter key="prediction_average" value="true"/>
          <parameter key="relative_error" value="true"/>
          <parameter key="relative_error_lenient" value="true"/>
          <parameter key="root_mean_squared_error" value="true"/>
      </operator>
  </operator>

</process>
Here are 2 rows to training data

0.99307958477511 1:2 2:12 3:0.982609455619486 4:0 5:14 6:5 7:0.8 8:0.0348258706467662 9:201 10:0.0496977837474815 11:1489 1
2:1 13:1 14:0.00477630731561417 15:133 16:10.81 17:5.5 101:1 116:1 117:1 119:1 125:1\
0.989655172413817 1:3 2:2 3:0.973641810178274 4:0 5:63 6:3 7:1 8:0.0631443298969072 9:776 10:0.0769704433497537 11:1624 12:
1 13:0.5 14:0.0049596226732805 15:123 16:-0.09 17:6 101:1 116:1 117:1 119:1 125:1
here is the test XML
<?xml version="1.0" encoding="MacRoman"?>
<process version="4.3">

  <operator name="Root" class="Process" expanded="yes">
      <operator name="SparseFormatExampleSource" class="SparseFormatExampleSource">
          <parameter key="data_file" value="/Users/noah/test.txt"/>
          <parameter key="dimension" value="141"/>
          <parameter key="format" value="yx"/>
      </operator>
      <operator name="ModelLoader" class="ModelLoader">
          <parameter key="model_file" value="/Users/noah/sparse_c4_1000.mod"/>
      </operator>
      <operator name="ModelApplier" class="ModelApplier">
          <list key="application_parameters">
          </list>
          <parameter key="create_view" value="true"/>
          <parameter key="keep_model" value="true"/>
      </operator>
  </operator>

</process>
here are 2 rows of test data
1:0 2:14 3:0.979392741314451 4:0.0909090909090909 5:28 6:22 7:0.227272727272727 8:0.0436046511627907 9:1376 10:0.0735090152
565881 11:1442 12:0 13:2 14:0.0104266852405951 15:133 16:9.64 17:8.09 103:1 116:1 117:1 119:1 125:1
1:0 2:1 3:0.980626115895827 4:0.0357142857142857 5:20 6:28 7:0.178571428571429 8:0.0338541666666667 9:768 10:0.065300896286
8118 11:781 12:0.321428571428571 13:0.2 14:0.0067155135256289 15:130 16:6.64 17:8.32 102:1 111:1 117:1 119:1 125:1

Answers

  • land
    land New Altair Community Member
    Hi,
    this process does not contain any obvious errors. (To cite my favorite error message)
    Perhabs you only need to tune the SVM Parameters?
    As a second hint: It is much more comfortable to use the build in validation operators instead of splitting the data manually and use two processes. You could use the XValidation, which is explained in the 04_Validation/03_XValidation_Numerical.xml sample in the sample directory.
    To tune your SVM Parameters you could take a look at the 07_Meta/01_ParameterOptimization Sample.

    Greetings,
      Sebastian
  • noah977
    noah977 New Altair Community Member
    Sebastian,

    I HAVE performed the parameter optimization and XV validation  to build a good model.

    What you are seeing in my earlier post is using the model on "real-world" data.  This was an actual application of the SVM to learn something about unlabeled data.

    My concern is this:  If the XV during the training showed decent results, why would the SVM predict the exact same output for the REAL data??  It is possible but very highly unlikely.

    -N
  • land
    land New Altair Community Member
    Hi,
    this is strange indeed. The attribute header is exactly the same as in the trainingsset?
    Without the data I can't image any other possible error, since I cannot reproduce the behavior.

    Greetings,
      Sebastian
  • noah977
    noah977 New Altair Community Member
    Sebastian,

    I think I found the problem.  My data is a 2 class problem.
    14% is class 1
    86% is class 0

    From what I've recently read, having "unbalanced" training sets can cause the SVM to develop a model that heavily favors the larger class.  This would explain the results I've been seeing.

    My question is:  Is there a way to have RapidMiner weight the classes or account for the unbalanced training data?

    Thanks,

    -N
  • land
    land New Altair Community Member
    Hi,
    there is an operator called EqualLabelWeighting which will distribute an equal total weight on all classes. Hence exampes of a dominating class will be down weighted.
    But you then will need a learner capable of using example weights. You should check this in the operators info of the learning operator.

    Greetings,
      Sebastian