Log Loss/Cross Entropy

earmijo
earmijo New Altair Community Member
edited November 5 in Community Q&A

I'm trying to understand how Rapidminer computes Cross Entropy.  I set up a basic process and asked RM to compute Cross Entropy. I'm attaching the process. But when I compute cross-entropy by hand (excel) or in other programs (R/Python), I get a different number from the one I'm getting in RM.

 

Cross Entropy =  - { y Ln ( p ) + (1-y) Ln(1-p) } 

 

RM Cross Entropy = 0.422

RM Excel/R/Python = 0.3135. 

 

Google sheet:

https://docs.google.com/spreadsheets/d/1o1r3VgsrJxe4R27SV23WmUy7JVGDFIEMAK3riU-TxV0/edit?usp=sharing

 

Am I missing something basic?

 

Thanks in advance for any help,

 

\E

 

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.0.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="179" y="136">
<parameter key="repository_entry" value="//Samples/data/Golf"/>
</operator>
<operator activated="true" class="h2o:logistic_regression" compatibility="9.0.000" expanded="true" height="124" name="Logistic Regression" width="90" x="380" y="136"/>
<operator activated="true" class="apply_model" compatibility="9.0.001" expanded="true" height="82" name="Apply Model" width="90" x="581" y="136">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="9.0.001" expanded="true" height="82" name="Performance" width="90" x="715" y="34">
<parameter key="cross-entropy" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_op="Retrieve Golf" from_port="output" to_op="Logistic Regression" to_port="training set"/>
<connect from_op="Logistic Regression" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Logistic Regression" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="result 1"/>
<connect from_op="Performance" from_port="example set" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

 

 

Tagged:

Best Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓

    Hi @earmijo,

     

    I'm not a specialist of Java but I checked the code of RapidMiner and maybe a beginning of element of answer : 

    I think RapidMiner calculates the Cross Entropy like this (and not like the formula you mentionned) : 

    Cross_entropy.png

     

    The link to this Wikipedia article.

     

    Note : However, I'm not able to find RapidMiner's results with this last formula...

     

    Regards,

     

    Lionel

  • IngoRM
    IngoRM New Altair Community Member
    Answer ✓

    If you already got to the 0.45 you are almost there :-)

    Indeed we are using the estimation described in the linked article, i.e. we sum up the log dualis of the confidences of the true labels.  But then we do divide it by (N+1) which is 15 in this case and when you do this, you will get the 0.42 instead of the 0.45.

    I do not remember, but I think we did use the plus 1 to avoid divisions by zero in extreme cases...  for larger data sets, this does not really matter and the numbers are pretty much the same.  For smaller data, like for the Golf data in this case, the difference can be a be a bit bigger...

    I have attached the Excel sheet (embedded in a PPT) showing the full calculation.

    Hope this helps,

    Ingo

Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓

    Hi @earmijo,

     

    I'm not a specialist of Java but I checked the code of RapidMiner and maybe a beginning of element of answer : 

    I think RapidMiner calculates the Cross Entropy like this (and not like the formula you mentionned) : 

    Cross_entropy.png

     

    The link to this Wikipedia article.

     

    Note : However, I'm not able to find RapidMiner's results with this last formula...

     

    Regards,

     

    Lionel

  • earmijo
    earmijo New Altair Community Member

    Thanks Lionel. You gave an idea: perhaps RM is using Base 2 for the log calculations. This gets me closer to the 0.422. Now I get 0.45. Good enough for me :-)

  • sgenzer
    sgenzer
    Altair Employee

    @lionelderkrikor if you can give me the GitHub link I can poke around internally and see if I can get you a better answer... :)

     

    Scott

     

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @sgenzer,

     

    Thanks you, Scott : 

    After trying many combinaisons of calculus, I still can not find the results of RapidMiner.

    Although I'm not one to give up easily, I must admit that I am currently a little discouraged...:smileymad:

     

    Here the link to the Github repository.

     

    Regards,

     

    Lionel

     

     

  • sgenzer
    sgenzer
    Altair Employee

    got it. I've pinged folks internally and will let you know (or they will post here).

     

    Scott

     

  • IngoRM
    IngoRM New Altair Community Member
    Answer ✓

    If you already got to the 0.45 you are almost there :-)

    Indeed we are using the estimation described in the linked article, i.e. we sum up the log dualis of the confidences of the true labels.  But then we do divide it by (N+1) which is 15 in this case and when you do this, you will get the 0.42 instead of the 0.45.

    I do not remember, but I think we did use the plus 1 to avoid divisions by zero in extreme cases...  for larger data sets, this does not really matter and the numbers are pretty much the same.  For smaller data, like for the Golf data in this case, the difference can be a be a bit bigger...

    I have attached the Excel sheet (embedded in a PPT) showing the full calculation.

    Hope this helps,

    Ingo

  • earmijo
    earmijo New Altair Community Member

    Mystery solved. Thanks to all of you. It feels good to understand how things work.