Positive class assigning logic
Hi rapidminers,
Alike topic has been discussed already in many threads here (yes I've read them all...) but still this is something which becomes pretty unclear eventually.
Say, we have a binary classifier and two classes, 1/0 or Yes/No or true/false, whatever.
We optimize the performnace towards recall, which is true positive rate:
There's an internal mapping of labels to positive and negative class, but which is not known beforehand.
So we end up with recall where we see that positive class has been assigned to label 0:
In my case 0 is 'good', so we optimized for best prediction of good cases (in real life problem, for example, this might mean to correctly detect maximum number of good transactions at the cost of maybe letting also some bad transactions through). But in fact, this is known only after the performance evaluation when we see what ecacly label is assigned to a positive class.
How can we know BEFOREHAND wht's the logic of assigning the classes to positive and negative, regardless of actual labels?
I know about 'REMAP BINOMIALS' which is a bit tricky operator as it changes INTERNAL mapping but has no effect on the visible results; hence, if I apply it here in any way, I still will ALWAYS get that 'positive class: 0' regardless of remapping.
This starts to drive me nuts as I feel like not understanding apriori, WHAT EXACTLY CLASS RECALL I am optimizing for.
Can anyone explain the certain logic of assigning classes once again and if there's any way of intentionally changing it? Or RapidMiner logic always chooses positive class in some certain way for each certain dataset / process and we just have to live with it once we found out what is an actual positive class in this case?
Thanks a lot.
Best Answer
-
Hi,
Well, that is indeed exactly the use case for the operator "Remap Binominals". There you define which one is the positive class and which one is the negative class. I have attached a sample below.
The internal logic is pretty simple: the first nominal value in the internal mapping is becoming the positive one. And how do you become the first value? By being loaded into RapidMiner before other values are loaded. So you could change the order in the original data source so that the first example has the class which is desired to be positive. Although that would work, this seems to be too cumbersome. Or you use "Remap Binominal" like in the sample process below. This is what I would do if I am more interested in one of the classes (vs. let's say the general accuracy or a cost-based optimization).
So, the operator you mention is the right one to make this change. Which leads me to the second part: How do you know that you need to make a change in the first place? Of course you can run the validation first and then see if your desired class is mentioned as the positive one at the top of, for example, the performance viewer of "recall". Or you can use "Remap Binominals" in all cases just to be sure. But there is also a (very subtle) hint in the example set itself, namely in the "Chart" view. If you have two classes and visualize points in a, let's say, scatter plot, the positive class is the second, red one. As I said, this is subtle, but at least a way to tell what is the positive one without the need of running a validation first.
Hope this helps, the process is below.
Cheers,
Ingo
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Sonar" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="nominal_to_binominal" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Binominal" width="90" x="179" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="class"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Validation" width="90" x="313" y="34">
<parameter key="sampling_type" value="stratified sampling"/>
<process expanded="true">
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="45" y="34"/>
<connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
<description align="left" color="green" colored="true" height="80" resized="true" width="248" x="37" y="137">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_binominal_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
<parameter key="recall" value="true"/>
</operator>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="performance 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
<description align="left" color="blue" colored="true" height="103" resized="true" width="315" x="38" y="137">The model created in the Training step is applied to the current test set (10 %).<br/>The performance is evaluated and sent to the operator results.</description>
</process>
<description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
</operator>
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Sonar (2)" width="90" x="45" y="340">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="nominal_to_binominal" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Binominal (2)" width="90" x="179" y="340">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="class"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="remap_binominals" compatibility="7.6.001" expanded="true" height="82" name="Remap Binominals" width="90" x="313" y="340">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="class"/>
<parameter key="include_special_attributes" value="true"/>
<parameter key="negative_value" value="Mine"/>
<parameter key="positive_value" value="Rock"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Validation (2)" width="90" x="447" y="340">
<parameter key="sampling_type" value="stratified sampling"/>
<process expanded="true">
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree (2)" width="90" x="45" y="34"/>
<connect from_port="training set" to_op="Decision Tree (2)" to_port="training set"/>
<connect from_op="Decision Tree (2)" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
<description align="left" color="green" colored="true" height="80" resized="false" width="248" x="37" y="137">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="45" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_binominal_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance (2)" width="90" x="179" y="34">
<parameter key="recall" value="true"/>
</operator>
<connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
<connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
<description align="left" color="blue" colored="true" height="103" resized="false" width="315" x="38" y="137">The model created in the Training step is applied to the current test set (10 %).<br/>The performance is evaluated and sent to the operator results.</description>
</process>
<description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
</operator>
<connect from_op="Retrieve Sonar" from_port="output" to_op="Nominal to Binominal" to_port="example set input"/>
<connect from_op="Nominal to Binominal" from_port="example set output" to_op="Validation" to_port="example set"/>
<connect from_op="Validation" from_port="performance 1" to_port="result 1"/>
<connect from_op="Retrieve Sonar (2)" from_port="output" to_op="Nominal to Binominal (2)" to_port="example set input"/>
<connect from_op="Nominal to Binominal (2)" from_port="example set output" to_op="Remap Binominals" to_port="example set input"/>
<connect from_op="Remap Binominals" from_port="example set output" to_op="Validation (2)" to_port="example set"/>
<connect from_op="Validation (2)" from_port="performance 1" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>5
Answers
-
This is a great question, I have also wondered whether there is a certain way to force RapidMiner to treat a given class as positive for its performance metrics.
0 -
Hi,
Well, that is indeed exactly the use case for the operator "Remap Binominals". There you define which one is the positive class and which one is the negative class. I have attached a sample below.
The internal logic is pretty simple: the first nominal value in the internal mapping is becoming the positive one. And how do you become the first value? By being loaded into RapidMiner before other values are loaded. So you could change the order in the original data source so that the first example has the class which is desired to be positive. Although that would work, this seems to be too cumbersome. Or you use "Remap Binominal" like in the sample process below. This is what I would do if I am more interested in one of the classes (vs. let's say the general accuracy or a cost-based optimization).
So, the operator you mention is the right one to make this change. Which leads me to the second part: How do you know that you need to make a change in the first place? Of course you can run the validation first and then see if your desired class is mentioned as the positive one at the top of, for example, the performance viewer of "recall". Or you can use "Remap Binominals" in all cases just to be sure. But there is also a (very subtle) hint in the example set itself, namely in the "Chart" view. If you have two classes and visualize points in a, let's say, scatter plot, the positive class is the second, red one. As I said, this is subtle, but at least a way to tell what is the positive one without the need of running a validation first.
Hope this helps, the process is below.
Cheers,
Ingo
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Sonar" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="nominal_to_binominal" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Binominal" width="90" x="179" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="class"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Validation" width="90" x="313" y="34">
<parameter key="sampling_type" value="stratified sampling"/>
<process expanded="true">
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="45" y="34"/>
<connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
<description align="left" color="green" colored="true" height="80" resized="true" width="248" x="37" y="137">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_binominal_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
<parameter key="recall" value="true"/>
</operator>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="performance 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
<description align="left" color="blue" colored="true" height="103" resized="true" width="315" x="38" y="137">The model created in the Training step is applied to the current test set (10 %).<br/>The performance is evaluated and sent to the operator results.</description>
</process>
<description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
</operator>
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Sonar (2)" width="90" x="45" y="340">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="nominal_to_binominal" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Binominal (2)" width="90" x="179" y="340">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="class"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="remap_binominals" compatibility="7.6.001" expanded="true" height="82" name="Remap Binominals" width="90" x="313" y="340">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="class"/>
<parameter key="include_special_attributes" value="true"/>
<parameter key="negative_value" value="Mine"/>
<parameter key="positive_value" value="Rock"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Validation (2)" width="90" x="447" y="340">
<parameter key="sampling_type" value="stratified sampling"/>
<process expanded="true">
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree (2)" width="90" x="45" y="34"/>
<connect from_port="training set" to_op="Decision Tree (2)" to_port="training set"/>
<connect from_op="Decision Tree (2)" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
<description align="left" color="green" colored="true" height="80" resized="false" width="248" x="37" y="137">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="45" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_binominal_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance (2)" width="90" x="179" y="34">
<parameter key="recall" value="true"/>
</operator>
<connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
<connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
<description align="left" color="blue" colored="true" height="103" resized="false" width="315" x="38" y="137">The model created in the Training step is applied to the current test set (10 %).<br/>The performance is evaluated and sent to the operator results.</description>
</process>
<description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
</operator>
<connect from_op="Retrieve Sonar" from_port="output" to_op="Nominal to Binominal" to_port="example set input"/>
<connect from_op="Nominal to Binominal" from_port="example set output" to_op="Validation" to_port="example set"/>
<connect from_op="Validation" from_port="performance 1" to_port="result 1"/>
<connect from_op="Retrieve Sonar (2)" from_port="output" to_op="Nominal to Binominal (2)" to_port="example set input"/>
<connect from_op="Nominal to Binominal (2)" from_port="example set output" to_op="Remap Binominals" to_port="example set input"/>
<connect from_op="Remap Binominals" from_port="example set output" to_op="Validation (2)" to_port="example set"/>
<connect from_op="Validation (2)" from_port="performance 1" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>5 -
Thanks @IngoRM, it's always great to learn a new RapidMiner trick!
1 -
Thanks @IngoRM!
That all seems logical and correct, with the exception of the fact that (I swear!) for some reason 'REMAP BINOMIALS' hadn't any effect on the class mapping when I previously tried it today, or maybe it's just the end of the hard week... don't know. I just tried it now once again and magically it works and remaps classes. So yes, this is what I wanted to achieve.
Good weekend everyone!
1 -
Suuuuuure :smileytongue:
Maybe you have accidentally swapped the two classes in the parameter settings or something like that. Anyway, I am glad it works now.
Enjoy your weekend :smileywink:
1