Optimize Parameters fails on F-measure

HeikoPaulheim · February 2015

Hi,

I try to optimize parameters towards F-measure. There may be cases where the F-measure is undefined (if there are no true positives), but I know that some configurations exist where F-measure is at least defined (i.e., at least one true positive).

The optimize (grid) operator, however, always returns a configuration where F-measure is undefined.

Is there any way to circumvent that behavior?

Best,
Heiko

fras · February 2015

Are you really optimizing your model with respect to F-measure ? Please put your process here to check:

XML

HeikoPaulheim · February 2015

There it is. Yields an F-measure of 0. If I change the main measure to AUC, it yields an F-measure of ~37%, so it's technically possible to get a higher value here.


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="5.3.015" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
        <parameter key="csv_file" value="C:\Users\Heiko\Documents\Forschung\DBpediaDebugging\redirects\training_features.csv"/>
        <parameter key="column_separators" value="&#9;"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Original.true.polynominal.id"/>
          <parameter key="1" value="Replaced.true.polynominal.batch"/>
          <parameter key="2" value="Correct.true.binominal.label"/>
          <parameter key="3" value="Plausible.true.integer.attribute"/>
          <parameter key="4" value="Distribution.true.real.attribute"/>
          <parameter key="5" value="Levenstein.true.integer.attribute"/>
          <parameter key="6" value="Levenstein (relative).true.real.attribute"/>
          <parameter key="7" value="Jaccard.true.real.attribute"/>
          <parameter key="8" value="Jaro.true.real.attribute"/>
          <parameter key="9" value="JaroWinkler.true.real.attribute"/>
          <parameter key="10" value="Prefix.true.real.attribute"/>
          <parameter key="11" value="Prefix2.true.real.attribute"/>
          <parameter key="12" value="Substring1.true.real.attribute"/>
          <parameter key="13" value="Substring2.true.real.attribute"/>
          <parameter key="14" value="Redirects.true.integer.attribute"/>
          <parameter key="15" value="Disambiguations.true.integer.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="optimize_parameters_grid" compatibility="5.3.015" expanded="true" height="94" name="Optimize Parameters (2)" width="90" x="246" y="30">
        <list key="parameters">
          <parameter key="SVM (4).gamma" value="[0.0000001;1000000;13;logarithmic]"/>
          <parameter key="SVM (4).C" value="[0.0000001;1000000;13;logarithmic]"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="x_validation" compatibility="5.1.002" expanded="true" height="112" name="Validation (3)" width="90" x="246" y="30">
            <description>A cross-validation evaluating a decision tree model.</description>
            <process expanded="true">
              <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.3.015" expanded="true" height="76" name="SVM (4)" width="90" x="90" y="30">
                <parameter key="gamma" value="1.0000000000000003E-4"/>
                <parameter key="C" value="1000000.0"/>
                <list key="class_weights">
                  <parameter key="0" value="20.0"/>
                  <parameter key="1" value="1.0"/>
                </list>
              </operator>
              <connect from_port="training" to_op="SVM (4)" to_port="training set"/>
              <connect from_op="SVM (4)" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="5.3.015" expanded="true" height="76" name="Apply Model (5)" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_binominal_classification" compatibility="5.3.015" expanded="true" height="76" name="Performance (5)" width="90" x="179" y="30">
                <parameter key="f_measure" value="true"/>
              </operator>
              <connect from_port="model" to_op="Apply Model (5)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (5)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (5)" from_port="labelled data" to_op="Performance (5)" to_port="labelled data"/>
              <connect from_op="Performance (5)" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="input 1" to_op="Validation (3)" to_port="training"/>
          <connect from_op="Validation (3)" from_port="averagable 1" to_port="performance"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Optimize Parameters (2)" to_port="input 1"/>
      <connect from_op="Optimize Parameters (2)" from_port="parameter" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

HeikoPaulheim · February 2015

If I may make a guess for the cause here: I think RM internally computes the F-measure without checking for tp=0. If you divide by 0 in Java, the result becomes larger than any other double:


double d1 = 1.0;
double d2 = 1.0/0.0;
System.out.println(d1>d2);
System.out.println(d2>d1);

Thus, if not handled separately, a configuration which produces zero true positives (i.e., both recall and precision are 0) will always be favored over any other configuration, since the F-measure is a term with 0 as its denominator. Usually, F1 is defined as 0 if tp=0, although the term itself is undefined for that case.

Optimize Parameters fails on F-measure

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories