Need help on removing classifier model skew

ram_nit05
ram_nit05 New Altair Community Member
edited November 5 in Community Q&A
Hi,

To provide a bried background to my exercise,
My objective is to create a SVM Classifier model which would classify customer feedback(attribute) into one of the various categories(Label).For this am trying to generate features from feedback verbatims which I then pass as attributes to the model.

The issue that am facing is, it could be observed from the classification errors of the model that the model is highly skewed towards the categories where the number of occurences was high(for highest frequency segment: class precision = low but class recall = high), i.e, the categories with lower frequencies were also being predicted as the ones with highest frequency. I have tried weighting the lower frequency segments suitably to remove differences in the occurences, but the errors are only getting magnified. Please let me know if there is any other way in which this can be controlled.

Many thanks in advance,
Ram

Answers

  • keith
    keith New Altair Community Member
    My first thought would be to either oversample the categories that occur less frequently, or to take only a portion of the very frequent categories so that the training set has approximately equal proportions of each category.  This might do better than just giving a higher weight to the examples from rare categories.

    You might also look at the MetaCost operator to increase the penalty for misclassifying the rarer instances.

    Hope that helps.  I'm sure other people smarter than me will chime in as well.  :-)

    Keith


  • ram_nit05
    ram_nit05 New Altair Community Member
    Many thanks Keith for the help.

  • ram_nit05
    ram_nit05 New Altair Community Member
    Hi Keith,

    I tried using Metacost operator in my modeling flow today, however I got error saying that it cannot take in numerical attributes, and I seem to be unable to understand if I should use it during the before or after %xvalidation operator in the flow. Could you please provide some link where I can find information on the same.

    Thanks,
    Ram
  • land
    land New Altair Community Member
    Hi,
    probably there is an error in your process setup. It seems to me, that you have used a learner inside the metaCost operator, that does not support the handling of numerical attributes. You should check that.

    Greetings,
      Sebastian
  • ram_nit05
    ram_nit05 New Altair Community Member
    Many thanks for your help Sebastian
  • brianbaker
    brianbaker New Altair Community Member
    I go the same numeric error with a learner that does support numeric processing, and throws no error outside of metacost.
  • land
    land New Altair Community Member
    Hi,
    please be a little bit more specific. It would be of great help posting the process for example and describing what you are going to do.

    Greetings,
      Sebastian
  • brianbaker
    brianbaker New Altair Community Member
    This works:

        <operator name="SimpleValidation (2)" class="SimpleValidation" breakpoints="after" expanded="yes">
            <parameter key="local_random_seed" value="10"/>
            <operator name="JMySVMLearner" class="JMySVMLearner">
                <parameter key="keep_example_set" value="true"/>
                <parameter key="max_iterations" value="100"/>
                <parameter key="calculate_weights" value="true"/>
                <parameter key="return_optimization_performance" value="true"/>
                <parameter key="estimate_performance" value="true"/>
                <parameter key="balance_cost" value="true"/>
            </operator>
            <operator name="ApplierChain (3)" class="OperatorChain" expanded="yes">
                <operator name="Applier (3)" class="ModelApplier">
                    <parameter key="keep_model" value="true"/>
                    <list key="application_parameters">
                    </list>
                    <parameter key="create_view" value="true"/>
                </operator>
                <operator name="BinominalClassificationPerformance (2)" class="BinominalClassificationPerformance">
                    <parameter key="keep_example_set" value="true"/>
                    <parameter key="main_criterion" value="AUC"/>
                    <parameter key="AUC" value="true"/>
                    <parameter key="lift" value="true"/>
                    <parameter key="false_positive" value="true"/>
                    <parameter key="false_negative" value="true"/>
                    <parameter key="true_positive" value="true"/>
                    <parameter key="true_negative" value="true"/>
                </operator>
            </operator>
    But this doesn't: Error in: MetaCost (MetaCost) This learning scheme does not have sufficient capabilities for the given data set: numerical attributes not supported

        <operator name="MetaCost" class="MetaCost" expanded="yes">
            <parameter key="keep_example_set" value="true"/>
            <parameter key="cost_matrix" value="[0.0 1.0;5.0 0.0]"/>
            <operator name="SimpleValidation (2)" class="SimpleValidation" breakpoints="after" expanded="yes">
                <parameter key="local_random_seed" value="10"/>
                <operator name="JMySVMLearner" class="JMySVMLearner">
                    <parameter key="keep_example_set" value="true"/>
                    <parameter key="max_iterations" value="100"/>
                    <parameter key="calculate_weights" value="true"/>
                    <parameter key="return_optimization_performance" value="true"/>
                    <parameter key="estimate_performance" value="true"/>
                    <parameter key="balance_cost" value="true"/>
                </operator>
                <operator name="ApplierChain (3)" class="OperatorChain" expanded="yes">
                    <operator name="Applier (3)" class="ModelApplier">
                        <parameter key="keep_model" value="true"/>
                        <list key="application_parameters">
                        </list>
                        <parameter key="create_view" value="true"/>
                    </operator>
                    <operator name="BinominalClassificationPerformance (2)" class="BinominalClassificationPerformance">
                        <parameter key="keep_example_set" value="true"/>
                        <parameter key="main_criterion" value="AUC"/>
                        <parameter key="AUC" value="true"/>
                        <parameter key="lift" value="true"/>
                        <parameter key="false_positive" value="true"/>
                        <parameter key="false_negative" value="true"/>
                        <parameter key="true_positive" value="true"/>
                        <parameter key="true_negative" value="true"/>
                    </operator>
                </operator>
            </operator>
        </operator>
    Thank you for your help.  I have a small positive rate, < 10%, and a small data set.  So, I'd like to modify the cost for the learner and use cross-validation rather than oversampling (so I don't have to split into train, test, validate).
  • land
    land New Altair Community Member
    Hi,
    although you just sent a small part of the process, I can definitively say that this will not work. The MetaCost operator will need an inner learner for operating, hence it is called Metacost. It works simply like that:
    For performing a cross-validation you need an inner learner. You want to modify the svm for the imbalanced class set by using the metaCost operator. Then put the SVM directly into the MetaCost operator and then put the MetaCost operator as learner inside the SVM.

    Greetings,
      Sebastian
  • brianbaker
    brianbaker New Altair Community Member
    This confuses me:
    put the SVM directly into the MetaCost operator and then put the MetaCost operator as learner inside the SVM
    Did you mean this:put the SVM directly into the MetaCost operator and then put the MetaCost operator as learner inside the XValidation

    I tried that and it works.  So, I think I am using it correctly.  I'm getting the balancing I'm after. :)

        <operator name="confidence estimate" class="XValidation" breakpoints="after" expanded="yes">
            <parameter key="keep_example_set" value="true"/>
            <parameter key="create_complete_model" value="true"/>
            <operator name="MetaCost (2)" class="MetaCost" expanded="yes">
                <parameter key="keep_example_set" value="true"/>
                <parameter key="cost_matrix" value="[0.0 3.0;1.0 0.0]"/>
                <operator name="KernelNaiveBayes (5)" class="KernelNaiveBayes">
                    <parameter key="keep_example_set" value="true"/>
                    <parameter key="estimation_mode" value="full"/>
                    <parameter key="number_of_kernels" value="35"/>
                </operator>
            </operator>
            <operator name="OperatorChain (2)" class="OperatorChain" expanded="yes">
                <operator name="ModelApplier (2)" class="ModelApplier">
                    <parameter key="keep_model" value="true"/>
                    <list key="application_parameters">
                    </list>
                    <parameter key="create_view" value="true"/>
                </operator>
                <operator name="Performance (2)" class="Performance">
                    <parameter key="keep_example_set" value="true"/>
                </operator>
            </operator>
    Thanks for your help!!
  • land
    land New Altair Community Member
    Hi,
    what confused you, was my confusion. Of course I meant it the way, you actual did it :)

    Greetings,
      Sebastian