"PaREn Extension"

dragonedison · September 2010

Dear everyone,

I found the new update for RapidMiner includes the PaREn Extension, which claims that it can suggest a most suitable classification method for the dataset. I would like very much to know how to use this extension.

Regards,
Gary

dan_agape · September 2010

Hi,

Try this

http://madm.dfki.de/rapidminer/wizard

However, perhaps some fixing may still be needed; I have tried to follow the guidelines in a simple test and was not successful in running it till the end.

Regards
Dan

Unknown · September 2010

Hello all,

I found the LandMarking operator doesn't work out of the box but by deselecting the "Linear Discriminant" check box I got a successful run.

Here's an example that predicts the KNN operator will do best on the Sonar data set and lo and behold it seems to - so that's quite cool.


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.10" expanded="true" name="Process">
    <process expanded="true" height="557" width="614">
      <operator activated="true" class="retrieve" compatibility="5.0.10" expanded="true" height="60" name="Sonar data set" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Sonar"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="5.0.10" expanded="true" height="130" name="Multiply" width="90" x="45" y="210"/>
      <operator activated="true" class="x_validation" compatibility="5.0.10" expanded="true" height="112" name="Decision Tree (2)" width="90" x="179" y="390">
        <description>A cross-validation evaluating a decision tree model.</description>
        <process expanded="true" height="549" width="310">
          <operator activated="true" class="decision_tree" compatibility="5.0.10" expanded="true" height="76" name="Decision Tree" width="90" x="112" y="30"/>
          <connect from_port="training" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="549" width="310">
          <operator activated="true" class="apply_model" compatibility="5.0.10" expanded="true" height="76" name="Apply Model (3)" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.0.10" expanded="true" height="76" name="Performance (Decision Tree)" width="90" x="179" y="30"/>
          <connect from_port="model" to_op="Apply Model (3)" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (Decision Tree)" to_port="labelled data"/>
          <connect from_op="Performance (Decision Tree)" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.0.10" expanded="true" height="112" name="Naive Bayes" width="90" x="179" y="255">
        <description>A cross-validation evaluating a decision tree model.</description>
        <process expanded="true" height="396" width="301">
          <operator activated="true" class="naive_bayes_kernel" compatibility="5.0.10" expanded="true" height="76" name="Naive Bayes (Kernel)" width="90" x="110" y="30"/>
          <connect from_port="training" to_op="Naive Bayes (Kernel)" to_port="training set"/>
          <connect from_op="Naive Bayes (Kernel)" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="396" width="301">
          <operator activated="true" class="apply_model" compatibility="5.0.10" expanded="true" height="76" name="Apply Model (2)" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.0.10" expanded="true" height="76" name="Performance (Naive Bayes)" width="90" x="179" y="30"/>
          <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (Naive Bayes)" to_port="labelled data"/>
          <connect from_op="Performance (Naive Bayes)" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.0.0" expanded="true" height="112" name="KNN" width="90" x="179" y="120">
        <description>A cross-validation evaluating a decision tree model.</description>
        <process expanded="true" height="654" width="466">
          <operator activated="true" class="k_nn" compatibility="5.0.10" expanded="true" height="76" name="k-NN" width="90" x="179" y="30"/>
          <connect from_port="training" to_op="k-NN" to_port="training set"/>
          <connect from_op="k-NN" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="654" width="466">
          <operator activated="true" class="apply_model" compatibility="5.0.0" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.0.0" expanded="true" height="76" name="Performance (KNN)" width="90" x="179" y="30"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance (KNN)" to_port="labelled data"/>
          <connect from_op="Performance (KNN)" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="paren:landmarking" compatibility="5.0.0" expanded="true" height="60" name="LandMarking" width="90" x="179" y="30">
        <parameter key="Linear Discriminant" value="false"/>
        <parameter key="Cross-validation" value="true"/>
        <parameter key="Normalize Dataset" value="false"/>
      </operator>
      <connect from_op="Sonar data set" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="LandMarking" to_port="exampleset"/>
      <connect from_op="Multiply" from_port="output 2" to_op="KNN" to_port="training"/>
      <connect from_op="Multiply" from_port="output 3" to_op="Naive Bayes" to_port="training"/>
      <connect from_op="Multiply" from_port="output 4" to_op="Decision Tree (2)" to_port="training"/>
      <connect from_op="Decision Tree (2)" from_port="averagable 1" to_port="result 4"/>
      <connect from_op="Naive Bayes" from_port="averagable 1" to_port="result 3"/>
      <connect from_op="KNN" from_port="averagable 1" to_port="result 2"/>
      <connect from_op="LandMarking" from_port="exampleset" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

Andrew

dragonedison · September 2010

Dear Dan,

Thank you! The link is exactly what I need.

Regards,
Gary

land · September 2010

Hi,
we are in contact with the guys from DFKI contributing this extension. They found out, it runs fine under linux but fails on windows machines. We will publish a new version as soon as possible.

Greetings,
Sebastian

fischer · September 2010

Hi all,

the fix is on the update server.

Best,
Simon

NeuralMarket · September 2010

Ah so I wasn't the only one crashing this plugin on a Windows machine. Thanks for the quick fix guys.

Thanks,
Tom

dan_agape · September 2010

Hi,

It is a great and very useful initiative to provide such an extension as PaREn. This kind of feature is included in other major DM software, so it was time. Many thanks to the PaREn team!

I have tested this feature again since operational on Windows machines, and would like to make some constructive comments that, added to those to follow from other guys, would hopefully be a useful feedback to the developers, for future improvements.

Using a dataset of 1000 rows with a binominal label, the accuracy of a PaREn optimised classifier based on decision trees was 0.692, actually under the accuracy 0.726 of the elementary zeroR model (based on taking the mode as the prediction in all cases). Separately I built a decision tree at a glance, that gave an accuracy of 0.737 - very small improvement, model that was tested via cross validation.

Not sure if the current order in which the figures are is statistically significant, but anyway, one would normally expect the PaREn optimised classifier to outperform both the subsequent DT and the trivial model blindly predicting the most frequent class.

Any other guys with comments on their results?

BTW, most probably the answer is yes - but could the PaREn team tell us whether they made use of the ROC analysis implemented in RM, among others, to optimise accuracy? Thanks.

Regards,
Dan

faisalshafait · September 2010

Hi Dan,

It is a great and very useful initiative to provide such an extension as PaREn. This kind of feature is included in other major DM software, so it was time. Many thanks to the PaREn team!

Thanks for your encouraging remarks. Can you please point to some DM software that has similar functionality?

Not sure if the current order, in which the figures are, is statistically significant, but anyway one would normally expect the PaREn optimised classifier to outperform both the subsequent DT and the trivial model blindly predicting the most frequent class.

You are right. Generally, optimized classifiers should perform better than a manually tuned one. However, currently we are doing a coarse grid search for a few parameters while using default values for others. In case of decision trees, search is just limited to the 'confidence' parameter. Any suggestions about which parameters to optimize are welcome.

could the PaREn team tell us whether they made use of the ROC analysis implemented in RM, among others, to optimise accuracy?

No, we are simply using classification accuracy for optimization purpose.

Cheers,
Faisal

NeuralMarket · October 2010

Faisal,

Thanks so much for providing this plugin! It really helps me in my data discovery tasks.

Regards,
Tom
www.neuralmarkettrends.com

dan_agape · October 2010

Hi Faisal,

Thanks for your encouraging remarks. Can you please point to some DM software that has similar functionality?

A similar (though not identical) feature, very effective indeed, is offered by IBM SPSS Modeler for instance as an automatic modeling operator, via which several models are produced automatically, and the best of them are proposed to the user. Moreover, the models may be combined to produce a kind of voting model, which may have better performance in some occasions than the individual models. See a demo here.

http://www.spss.com/media/demos/modeler/demo-modeler-overview/index.htm

Since you asked for suggestions, perhaps you can offer an option expressing how much the models are to be optimised, so that results can be produced in shorter or longer times upon choice. For practical reasons one can offer 3 levels for instance: low, medium, high levels of optimisation (corresponding processing times will increase accordingly). This would offer a balance between processing time and model performance (one of my tests on a dataset of 1000 rows was quite long to run and sometimes we may want to reduce this time).

Also, you may wish to automatically select the best 2-3 models and offer their respective RM processes, or alternatively one may build a process in which these models are put to vote, etc. Potentially your add-in can bring a lot of help to data miners. Thanks again and good luck!

Best,
Dan

fischer · October 2010

Hi,

just as an aside: Trying different models on a data set is easily possible using a combination of parameter optimization/subprocess selector. Maybe we should have a sample or building block for that :-)

Best,
Simon

kofler · October 2010

Hi,

thanks for the feedback.

Concerning the run-time of the evaluation (which includes optimization):
We are actually working on the prediction of the run-time as well. For each of the listed classifiers in the wizard you can then not only see the predicted accuracy but also the expected run-time for training on the given data. This should help a lot when certain constraints have to be met, e.g. on embedded systems (where computational power is limited) and if you want to choose a classifier with reasonable performance but also low energy consumption. Maybe we should try to trademark "Green Data Mining" before releasing the next version of the PaREn Automatic System Construction Wizard

Hm, the discussion has not much to do with "Problems and Support" - and I am really happy about this!
Anyway, if you experience any issues, please let us know.

Cheers

Christian

tolau100 · October 2010

Just wanna say THANKS to everyone establishing this wonderful tool. For me as a newbie in Rapidminer the automatic 'pre-'prediction and processing saves plenty of time I would have spent handling all settings in the normal GUI.

Since I've no improvements to add: Best wishes

land · October 2010

Hi all,
if I read this thread, I feel honored that all this discussion takes place in the Problems and Support forum moderated by me, but I wonder if it would be a good idea adding a new forum explicitly for the paren extension. What do you mean?

@Christian
If you are going to estimate the runtime of an operator, it might be useful to contact us. We have been working on the same issue for a while and probably can provide you with some help on that. Might be it would be a good idea to join our new Special Interest Group for Development of RapidMiner. I think you left RCOMM before we established them on the last day, is that possible?

Greetings,
Sebastian

kofler · October 2010

Hi Sebastian,

yes, we left on Wednesday evening and did not attend the final training day. Special Interest Group sounds interesting, please send me more info or point me to it if already available.

Well, I don't think that a dedicated forum for the PaREn Wizard is needed. Maybe one for third-party contributions? Kind of a pre-roll for the envisioned marketplace.

Regarding the timing predictions we would be happy to join forces and exchange insight. Looks like you should do a Rapid-I group excursion to Kaiserslautern

Cheers

Christian

land · October 2010

Hi Christian,
the mailling lists of the Special Interest Groups are listed on the sourceforge page of RapidMiner. We are still working on a page explaining in more detail topics and aims of each group, but nevertheless they are online. You might join there directly or write me an email and I will put you on the list.

The idea with a third party forum is good. I will add one.

I've never been in Kaiserslautern. Seems to be a good idea to change that now

I will contact you by mail.

Greetings,
Sebastian

IngoRM · October 2010

Hi,

A similar (though not identical) feature, very effective indeed, is offered by IBM SPSS Modeler for instance as an automatic modeling operator, via which several models are produced automatically, and the best of them are proposed to the user. Moreover, the models may be combined to produce a kind of voting model, which may have better performance in some occasions than the individual models. See a demo here.

whoa, but that's quite a difference: in SPSS all models are actually tested (which can also be done with the PaREn extension during the evaluation step but is also possible with a simple process for core RapidMiner as Simon has pointed out).

The cool thing about the PaREn extension is that it predicts which model is probably the best even without any testing. This is the first time I have actually see this meta learning approach really working and this is probably the reason why we at Rapid-I and many others love it. Kudos to the Christian and the team of the DFKI for this great extension!

I have also a suggestion: would be great if a k-fold cross validation or even a single split would be selectable instead of the rather time consuming LOO evaluation.

Cheers,
Ingo

tobyb · October 2010

I am getting an error when I perform Step 3 of the Automatic System Construction with the Iris dataset. It states, "No parameters were specified which should be optimized". The wizard closes after I click the OK button. I have followed the instructions exactly on how to use this extension. Has anyone else had this same problem?

10-27-10
At the time of my post I thought I had the latest update for PaREn. Once I installed the latest version I am not experiencing this problem any longer.

IngoRM · October 2010

Hi,

as Simon has pointed out: there should be a new version on our update server. Do you really use the latest version available? If yes: What's your OS?

Cheers,
Ingo

dan_agape · October 2010

whoa, but that's quite a difference: in SPSS all models are actually tested (which can also be done with the PaREn extension during the evaluation step but is also possible with a simple process for core RapidMiner as Simon has pointed out).

The cool thing about the PaREn extension is that it predicts which model is probably the best even without any testing. This is the first time I have actually see this meta learning approach really working and this is probably the reason why we at Rapid-I and many others love it. Kudos to the Christian and the team of the DFKI for this great extension!

@ Ingo: Note that SPSS Modeler has less, but very carefully chosen and highly optimised algorithms (there where it is possible - example: take C5 as opposed to C4.5 implemented in open source software). Therefore one affords to create models for most classification algorithms available in SPSS and to retain the best ones, in a reasonable amount of time.

Factually speaking (by the way as a fan of both software - RM and SPSS Modeler), there are obviously similarities and differences in the features we discuss about, and I am afraid that the differences show for now that SPSS Modeler is incomparably much ahead: time of running to build the best models, reliability and performance of models (see my previous posting above regarding unexpectedly suboptimal optimised PaREn models), the combination of best models in an overall model to use, etc. On the other hand, the estimated accuracies in PaREn were quite far from the actual accuracies in most of my experiments, but the idea is interesting.

@ Christian et al. : I would have an additional suggestion to which I had thought when posting questions earlier in this topic. ROC Analysis can be added to searching the model giving the best accuracy when the output/label attribute is binominal. More precisely, after finding the best parameters for a learner, given a dataset, one can get also the optimised threshold from a ROC curve (as opposed to using the default threshold 0.5), which guarantees the best accuracy.

However, perhaps this suggestion may be useful to consider after the ROC Analysis implemented in Rapid Miner would be revised as it is still unreliable in this package (i.e. AUC calculation needs corrections, as I have shown on the forum http://rapid-i.com/rapidforum/index.php?PHPSESSID=18d6261d2d63b2ca946477f03c2552bc&;topic=2237.0
, and Find Threshold operator does not find the best threshold as expected but provides suboptimal solutions - I emailed a complete report to the RM development team, with relevant processes illustrating this).

PaREn is an excellent initiative towards RM's enrichment. However the extension needs to be more practical and more accurate. Indeed, it requires relatively much processing time and models are not as optimised as expected - see postings in this thread, where it is explained that both - an ad hoc model created with no particular setting, and a trivial model that picks up blindly the most frequent class as prediction - are better in accuracy than the optimised, time consuming to build PaREn model. Improvement would be very beneficial and necessary indeed for the extension. Other users of the extension may wish to generate ad hoc models in addition to the PaREn models, and to compare their accuracies - this would be a useful feedback to the development team.

I hope the feedback and suggestions in this thread help and would be useful to PaREn, as part of the community's contribution to improve the open source software. Good luck!

Regards,
Dan

haddock · October 2010

Greetings to all,

The producers of C5.0 compare it against C4.5 here..

http://www.rulequest.com/see5-comparison.html

So a lot of the difference we already have, moreover C5.0 is closed source and not free.

AUC calculation needs corrections, as I have shown on the forum

Really? Check out my recent post http://rapid-i.com/rapidforum/index.php/topic,2237.msg10540.html#msg10540

But what has any of this to do with the PaREn extension? Not much! As Ingo says...

The cool thing about the PaREn extension is that it predicts which model is probably the best even without any testing.

So it simply misses the point to state that "models are not as optimised as expected".

Toode Pip!

crappy_viking · March 2011

Has Paren extension anything to do with Amine platform ? The formalism they use is said to be compatible with parallel processing (mix of lambda calculus into ontologies)
:http://amine-platform.sourceforge.net/.

"PaREn Extension"

Answers

Categories