Strange result from Naive Bayes classifier

pupu
pupu New Altair Community Member
edited November 5 in Community Q&A
Hello,

First of all, thank you so much to contribute this great DM tool ... you, guys, are so great
I'm new to DM .. and try using RM .. I'm trying to use Naive Bayes to predict whether new customer with a particular profile will/will not buy the product. I have set up the model like this ..
Quote
<operator name="Root" class="Process" expanded="yes">
    <operator name="TrainingSet" class="DatabaseExampleSource">
        <parameter key="database_url"  value="jdbc:mysql://localhost:3306/insurance"/>
        <parameter key="username"  value="xxx"/>
        <parameter key="password"  value="xxx"/>
        <parameter key="query"  value="select * from customer;"/>
        <parameter key="label_attribute"  value="CARAVAN"/>
        <parameter key="classes"  value="buy not_buy"/>
    </operator>
    <operator name="NaiveBayes" class="NaiveBayes">
    </operator>
    <operator name="TestSet" class="DatabaseExampleSource">
        <parameter key="database_url"  value="jdbc:mysql://localhost:3306/insurance"/>
        <parameter key="username"  value="xxx"/>
        <parameter key="password"  value="xxx"/>
        <parameter key="query"  value="select * from customer_eval;"/>
        <parameter key="label_attribute"  value="CARAVAN"/>
        <parameter key="classes"  value="buy not_buy"/>
    </operator>
    <operator name="ModelApplier" class="ModelApplier">
        <list key="application_parameters">
        </list>
    </operator>
</operator>
It works without error but in data view field: confidence(buy) and confidence(not_buy) return '?' as a result for each data record ..

Can anybody give me any clues to my error?

Thank you so much
Pupu.

and here is haddock reply
Hi there,

Firstly welcome to the dataminers' asylum! On your prob what happens if you apply the model on the training set, do you still get a row of ?'s in the prediction columns? Just disable your second database call to check it out. Make sure to tick "keep example set" in the learner. 

? usually represents a missing value, so I'm pondering what got learnt.. The setup looks fine so something murky is going on. I take it you've checked the training set and such.
Tagged:

Answers

  • pupu
    pupu New Altair Community Member
    To haddock,

    I apply model on the training set .. all values in prediction field are 'not_buy' and confidence(buy)/confidence(not_buy) are '?' for all records
    I have checked the training set .. there are no missing value but it is unbalanced like 6% is buy and 94% is not_buy ... is the unbalance matter relevant to my problem?

    Thank you very much
    Pupu.
  • haddock
    haddock New Altair Community Member
    Hi there,

    I think that is probably the cause of your problem, try balancing it up so it is more even. Why not get 50 buy records and 50 no_buy records and do a merge? Hope you get better results, get back if that doesn't do the trick.

    Onward, full ahead through the fog...

  • land
    land New Altair Community Member
    Hi,
    NaiveBayes has indeed problems with unbalanced examplesets. But this should not result in unkown confidence values. A more critical question on that issue: How many attributes does your exampleset contain?

    Greetings,
      Sebastian
  • TobiasMalbrecht
    TobiasMalbrecht New Altair Community Member
    Hi,

    well, Sebastians question will indeed be essential here. Unfortunately, Naive Bayes did produce unknown confidence values for data sets with a high number of attributes. We have robustified Naive Bayes regarding that issue - but after the release of version 4.4 of the Community Edition. The recent automatically delivered RapidMiner Enterprise Edition update already contains that bugfix. It will of course also be part of the next Community Edition release, which is probably about to come in a couple of weeks.

    If you like (and there are no privacy issues) you can send us a data sample and we can check if it works on the most recent developer version. If there should be any privacy issues and you need a solution very urgently, we could also build you a custom version for once. Just drop us a note.

    Kind regards,
    Tobias

  • pupu
    pupu New Altair Community Member
    Hi all  :),

    Thank you so much for your replies.

    To haddock,
    I tried what u suggest .. i split the data set to 50 for 'buy' and 50 for 'not_buy' ... Naive bayes still produce '?' for confidence value and prediction result is 50% correct.

    To Land,
    The data set has 85 attributes ... should I try feature/attribute subset selection before apply Naive Bayes?

    To Tobias Malbrecht,
    There is no privacy here ... actually it is data set from KDD cup '98 .. How can I send u the dataset?

    Best regards,
    Pupu.
  • pupu
    pupu New Altair Community Member
    Hi again,

    I just forget to tell that your examples are very useful to me ..  :D

    Best regards,
    Pupu.
  • pupu
    pupu New Altair Community Member
    Hello,

    As you mentioned about number of attributes .. i do "select <some fields> from table"
    those confidence value are shown now ..  :)
    I'm finding the way to do something with unbalanced data ..  >:( (Cheers myself)

    Thank you so much everyone.

    Pupu.