Classification results for each insance of a confusion matrix (HOW?)

Yanaira · August 2008

Hello!

I have a question concerning a certain output of an experiment I accomplished. Here the experiment:
-ExampleSource
-SimpleValidation
-NaiveBayes
-OperatorChain
-ModelApplier
-BinomialClassificationPerformance
-PerformanceWriter
-PerformanceLoader

The classification results I get are: the confidence and the values for false-positive and true-positive classification.
However, I additionally need the forcast results for each objekt of the dataset.That is: for each fp- and tp-rate / confusion matrix I need the classification results for the corresponding data objects (e.g. 0,1).

What kind of operator do I need in order to figure out how the classifier classified each instance of the data set (+ confidence, fp- and tp-rate)? And how do I use it?

.....hopefully my question will make sense to you. Thank you very much for your help!!

TobiasMalbrecht · August 2008

Hi,

if I got you right, you simply want to see the predictions of the Naive Bayes classifier, i.e. the class the classifier assigns to each instance according to the build model. Is that right?

This you may accomplish by simply using a [tt]ModelApplier[/tt] after having learned the model. This leads to the following (very simple) process setup:

-[tt]ExampleSource[/tt]
-[tt]NaiveBayes[/tt]
-[tt]ModelApplier[/tt]

You however have to set the parameter [tt]keep_example_set[/tt] of the [tt]NaiveBayes[/tt] operator to true, so that the example set is not consumed by the learner but transferred to the [tt]ModelApplier[/tt].

Hope this was helpful,
Tobias

IngoRM · August 2008

Hi,

I would like to add that this is of course not "fair" since the test data would have been used for training. But you could achieve the desired goal in a fair way by using the following setup:


<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function"	value="sum classification"/>
    </operator>
    <operator name="IdTagging" class="IdTagging">
    </operator>
    <operator name="XValidation" class="XValidation" expanded="yes">
        <operator name="NaiveBayes" class="NaiveBayes">
        </operator>
        <operator name="OperatorChain" class="OperatorChain" expanded="yes">
            <operator name="ModelApplier" class="ModelApplier">
                <list key="application_parameters">
                </list>
            </operator>
            <operator name="ExampleSetWriter" class="ExampleSetWriter">
                <parameter key="append"	value="true"/>
                <parameter key="example_set_file"	value="single_pred.dat"/>
                <parameter key="format"	value="special_format"/>
                <parameter key="special_format"	value="$i $p $d"/>
            </operator>
            <operator name="Performance" class="Performance">
            </operator>
        </operator>
    </operator>
</operator>

Hence, you can simply add an ExampleSetWriter after the model application and write down the desired results (append mode!). Please note also that I have used the special format parameters here.

Cheers,
Ingo

Yanaira · August 2008

Thank you very much for your suggestions

However, I still have some questions concerning the second idea proposed by Ingo:

- The exampleSetWriter displayes the predicted class (e.g. $p) as well as the actual class for each item of the data set (e.g. $l). Due to the fact that the algorithm produces probabilities, what threshold is chosen in order to produce these binary classifications? And how can this threshold be set? (->For further analysis I need to pick a certain threshold in order to receive corresponding predicted classes (and actual classes for each item.)

- What does the expression "confidence" in the rapidminer exactly stand for? (confidences=threshold)?

Thanks a lot!!!

steffen · August 2008

Hello Yanaira

- The exampleSetWriter displayes the predicted class (e.g. $p) as well as the actual class for each item of the data set (e.g. $l). Due to the fact that the algorithm produces probabilities, what threshold is chosen in order to produce these binary classifications? And how can this threshold be set? (->For further analysis I need to pick a certain threshold in order to receive corresponding predicted classes (and actual classes for each item.)

I guess the default threshold is 0.5, in the case of more than two possible classes the maximum is used. You can create and apply a threshold using the operators "ThresholdCreator" and "Threshold Applier". Note that this threshold can only be applied for a binary classification. Friendly Tip: In the operator tab you will find a small field where you can write search terms for operators in.

- What does the expression "confidence" in the rapidminer exactly stand for? (confidences=threshold)?

"confidence" is equivalent to your mentioned "probabilities". I guess the term "confidences" has been choosing to avoid confusing, since many classification algorithms produce only rough estimations of the true probabilities. However, in the case of NaiveBayes it has been shown, that although the probabilities tend to be extreme, their order (as in terms of scoring/ranking) is not affected and since reliable results are produced.

Hope this was helpful

Steffen

Classification results for each insance of a confusion matrix (HOW?)

Answers

Categories