questions on "Apply Model" operator and predicted label
I use "Apply Model" operator to predict the test data set. The generated results normally includes three types of information ( confidence (positive class), confidence (negative class), predicted label).
Naturally, when confidence (positive class) is larger than confidence (negative class), the prediction label is positive.
But I found a lot of cases ( using libsvm for text classification), even when confidence (positive ) is smaller than confidence (negative class), the prediction label is still positive. I would like to know why?
Naturally, when confidence (positive class) is larger than confidence (negative class), the prediction label is positive.
But I found a lot of cases ( using libsvm for text classification), even when confidence (positive ) is smaller than confidence (negative class), the prediction label is still positive. I would like to know why?
Tagged:
0
Answers
-
Actually, I have never seen such a case with a plain create model/apply model cycle. Anyway, you can define manual thresholds e.g. with Create Threshold and Apply Threshold, or shift the thresholds in a more sophisticated way with e.g. Choose Recall or other cost-sensitive learning schemes.
Best regards,
Marius0 -
Hi, thanks for the reply.
the following is the result of running the "apply model" operator. The model was training using LIBSVM operator. I just posted part of the result which shows the observation I mentioned in the original post, i.e., even the confidence (R) is smaller than confidence (NR), the prediction is still R.
confidence(R) confidence(NR) Prediction(Label)
0.528462399 0.471537601 R
0.524106922 0.475893078 R
0.516740761 0.483259239 R
0.509868083 0.490131917 R
0.505252829 0.494747171 R
0.493653526 0.506346474 R
0.485416242 0.514583758 R
0.475031465 0.524968535 R
0.466340913 0.533659087 R
0.459370807 0.540629193 R
0.458747466 0.541252534 R
0.4577908 0.5422092 R
0.435570459 0.564429541 R
0.432716957 0.567283043 R
0.42963305 0.57036695 R
0.422826691 0.577173309 R
0.412345117 0.587654883 R
0.404687872 0.595312128 R
0.40221958 0.59778042 R
0.39865042 0.60134958 R
0.398228918 0.601771082 R0 -
Hm, interesting. Can you please post your process xml as described in my signature?
Best regards,
Marius0 -
The following is the process that I have been using for scoring process.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true" height="386" width="711">
<operator activated="true" class="retrieve" compatibility="5.1.011" expanded="true" height="60" name="Retrieve" width="90" x="45" y="75">
<parameter key="repository_entry" value="SVM_Train_F_words_unigram_tf"/>
</operator>
<operator activated="true" class="text:process_document_from_file" compatibility="5.1.002" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="179" y="75">
<list key="text_directories">
<parameter key="R" value="E:\R_Validation"/>
<parameter key="NR" value="E:\NR_Validation"/>
</list>
<parameter key="extract_text_only" value="false"/>
<parameter key="vector_creation" value="Term Frequency"/>
<parameter key="prune_below_absolute" value="5"/>
<parameter key="prune_above_absolute" value="5000000"/>
<parameter key="parallelize_vector_creation" value="true"/>
<process expanded="true" height="362" width="674">
<operator activated="true" class="text:tokenize" compatibility="5.1.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="45" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="5.1.002" expanded="true" height="60" name="Transform Cases (2)" width="90" x="180" y="30"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="315" y="73"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="retrieve" compatibility="5.1.011" expanded="true" height="60" name="Retrieve (2)" width="90" x="179" y="300">
<parameter key="repository_entry" value="SVM_Train_F_model_unigram_tf"/>
</operator>
<operator activated="true" class="apply_model" compatibility="5.1.011" expanded="true" height="76" name="Apply Model" width="90" x="313" y="300">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.1.011" expanded="true" height="76" name="Performance" width="90" x="447" y="75">
<list key="class_weights"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="5.1.011" expanded="true" height="76" name="Select Attributes" width="90" x="447" y="210">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="|confidence(non_res)|confidence(res)|label|prediction(label)"/>
</operator>
<operator activated="true" class="write_csv" compatibility="5.1.011" expanded="true" height="60" name="Write CSV" width="90" x="581" y="165">
<parameter key="csv_file" value="E:\Project\svmscore.csv"/>
<parameter key="column_separator" value=","/>
<parameter key="quote_nominal_values" value="false"/>
<parameter key="format_date_attributes" value="false"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Process Documents from Files (2)" to_port="word list"/>
<connect from_op="Process Documents from Files (2)" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Retrieve (2)" from_port="output" to_op="Apply Model" to_port="model"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="result 2"/>
<connect from_op="Performance" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Write CSV" to_port="input"/>
<connect from_op="Write CSV" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>0 -
Whoo, you are using RapidMiner 5.1. In a few days RapidMiner 5.3 will be released - I strongly encourage you to update to the latest version (5.2.8) and try again. Please leave a note in this thread if your problem persists or if everything is working fine now.
Best regards,
Marius
0 -
Thanks, Marius. I will give it another try after updating Rapidminer
By the way, do you know how to output the distance between a given test data point and the hyperplane constructed by training data set? I am also referring to the LiBSVM operator in Rapidminer.0 -
Unfortunately, that's not possible. The confidence is an indicator for that, but the exact distance cannot be output.huaiyanggongzi wrote: By the way, do you know how to output the distance between a given test data point and the hyperplane constructed by training data set? I am also referring to the LiBSVM operator in Rapidminer.
Best regards,
Marius0