"Minimal use-case: YAGGA2, YAGGA"

Hi all,

First let me thank the developers for this wonderful tool. I've already had great success with some models.

Now, I'm trying to get YAGGA2 to work. My actual application is more complex than what's presented here, but I'd like to figure out a minimal setup that results in YAGGA2 functioning correctly before trying to apply it there.

So, here's some example data:

a,b,c
1,  1,  1.698970004
2,  13, 2
4,  26, 2.301029996
8,  40, 2.602059991
16, 55, 2.903089987
32, 71, 3.204119983
64, 88, 3.505149978
128,106,3.806179974
256,125,4.525511261
512,235,5.15174973

a and b will be our attributes, c will be our label. c is log10(max(abs(b-a),50)*a) -- presumably a good candidate for yagga2.

Here's my process:

[CSV] -> [YAGGA2 (NN -> Apply Model -> Performance)]

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
    <process expanded="true" height="467" width="815">
      <operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
        <parameter key="csv_file" value="C:\data\simpletest.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="a.true.integer.attribute"/>
          <parameter key="1" value="b.true.integer.attribute"/>
          <parameter key="2" value="c.true.real.label"/>
        </list>
      </operator>
      <operator activated="true" class="optimize_by_generation_yagga2" compatibility="5.1.017" expanded="true" height="94" name="Generate" width="90" x="246" y="30">
        <process expanded="true" height="647" width="950">
          <operator activated="true" class="neural_net" compatibility="5.1.017" expanded="true" height="76" name="Neural Net" width="90" x="112" y="30">
            <list key="hidden_layers"/>
            <parameter key="training_cycles" value="5"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model" width="90" x="246" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.1.017" expanded="true" height="76" name="Performance" width="90" x="380" y="30"/>
          <connect from_port="example set source" to_op="Neural Net" to_port="training set"/>
          <connect from_op="Neural Net" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Neural Net" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="performance sink"/>
        </process>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Generate" to_port="example set in"/>
      <connect from_op="Generate" from_port="example set out" to_port="result 3"/>
      <connect from_op="Generate" from_port="attribute weights out" to_port="result 2"/>
      <connect from_op="Generate" from_port="performance out" to_port="result 1"/>
    </process>
  </operator>
</process>

This consistently errors with "Process failed: Generation exception: 'java.lang.IllegalArgumentException: Duplicate attribute name: prediction(c)'". Attempting to remove this attribute anywhere in the above chain does no good.

Using YAGGA (not 2) this process will run, but no new attributes will be generated.

What am I doing wrong?

Find more posts tagged with

AI Studio

Use Cases

Accepted answers

All comments

Andrew2

Hello

The "apply model" operator is adding new attributes to the example set and these are being passed to the upper level of the YAGGA operator. The second time round, the attributes are added again but duplicates happen.

One way to fix it is to use a cross validation operator inside the YAGGA operator. This leaves the example set alone and produces an averaged estimate of what the performance could be on unseen data.

regards

Andrew

dromiceiomimus

Andrew, thank you so much. I have it working now.

Though, I must admit, I don't quite understand why.

The "apply model" operator is adding new attributes to the example set and these are being passed to the upper level of the YAGGA operator. The second time round, the attributes are added again but duplicates happen.

Makes enough sense to me. I wasn't thinking about the YAGGA operator's internal state and that being where the duplicates needed to not occur.

One way to fix it is to use a cross validation operator inside the YAGGA operator. This leaves the example set alone and produces an averaged estimate of what the performance could be on unseen data.

Has me so confused.

Why does cross validation work but not cross validation (parallel)? Are there other operators I could use aside from normal cross validation there? Is there some other (I suppose, better) way to try to employ the YAGGA operator, am I going about that wrong from the beginning?

Any hints on those?

Cheers.

IngoRM

Hi,

I wasn't thinking about the YAGGA operator's internal state and that being where the duplicates needed to not occur.

It has actually not so much to do with YAGGA's internal state. The duplicate attributes are not those created by YAGGA (it should be able to handle those and in fact one of the advantages of YAGGA2 should be to handle those better than the original one). The duplicates Andrew has been referring to are the duplicate predictions. The Apply Model operator creates a prediction which is stored in the attribute / column "prediction(c)". In the next round of YAGGA's internal evaluation, this duplicate attribute causes the problem.

By using an internal cross validation (or split validation) you will get a better and more robust performance estimation anyway and you don't have to clean up yourself but this will be done automatically by the validation operator. So I also highly recommend to use either a cross validation or a single split validation inside of the YAGGA operators. The same is true for basically all wrapper approaches for feature selection, generation, weighting...

Hope that clarifies things a bit.

Why does cross validation work but not cross validation (parallel)?

In principle this should also be possible. You should, however, not nest different parallel algorithms, i.e. you should not nest a parallel cross validation into a parallel feature selection / generation, for example.

Are there other operators I could use aside from normal cross validation there?

Yes, you could use X-Validation, Split Validation, Bootstrapping Validation, or Batch-X-Validation. If you are knowing what you are doing you could also create specialized subprocesses, but in that case you have to ensure to clean up the predictions yourself.

Is there some other (I suppose, better) way to try to employ the YAGGA operator, am I going about that wrong from the beginning?

No, in principal you should be fine. The rest is more about parameter tuning. One tip though: I would try YAGGA2 on slightly bigger data sets since otherwise probably either no new and interesting attributes will be created or it will directly result in overfitting. In your case, log(a) is already highly correlated with the label c so any additional attribute does not really help...

By the way, there is also a sample process for YAGGA in the Sample repository delivered with RapidMiner: Sample/processes/04_attributes/19_YAGGA in the case you have not seen this one yet...

Cheers,
Ingo