Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Collection of some ideas

Hi,

I am experimenting with rapidminer for a couple of weeks now and am very impressed by the great possibilities it offers and the very helpful team. Over the time, some ideas came to my mind. I'll post them in a short list and let you decide if one or two of them are any good:

More information when viewing a decision tree model: In addition to the graphic representation of the label distribution in each node or leaf, it would be nice if one could hover over a node/leaf and see the distribution in absolute numbers (how many cases of each class of the training set are in the current note/leaf
When doing parameter optimization, so far only the performance of the best combination is returned. It would be nice, if one could also see how other combinations performed (e.g. the top n combinations, where n would be a user defined value). Maybe there are combinations very close to the best one, that have other advantages which make them more desirable than the best one.
I would always like to see the final model in the end. Currently, this is not possible with all operators. e.g. the optimize selection operator trains a model, but does not allow you the see the final model in the end, without adding another model training step using the selected attributes
Stacking using probabilities instead/in addition to final labels. See http://rapid-i.com/rapidforum/index.php/topic,2744.0.html
Stop subprocess button, allowing you to exit of an "infinite loop" without canceling the entire process. See end of first post in http://rapid-i.com/rapidforum/index.php/topic,2745.0.html
Difficult to implement and not so important: Graphical representations of more models, e.g. a 2D-Representation of SVM, displaying how the boundary separates the data. Something like here: http://kernelsvm.tripod.com/

Thank you very much for considering this and best regards
Hanspeter

Find more posts tagged with

AI Studio

Accepted answers

All comments

IngoRM

Hi,

thanks for sending this list in. Please find some comments below:

More information when viewing a decision tree model: In addition to the graphic representation of the label distribution in each node or leaf, it would be nice if one could hover over a node/leaf and see the distribution in absolute numbers (how many cases of each class of the training set are in the current note/leaf

This is actually already implemented - at least for the leaf nodes. If you keep the mouse over a leaf node, a tooltip window will pop up showing more information. The inner nodes only show the total number of examples in this subtree, not their distribution until there. We could try to add the distribution numbers there as well.

When doing parameter optimization, so far only the performance of the best combination is returned. It would be nice, if one could also see how other combinations performed (e.g. the top n combinations, where n would be a user defined value). Maybe there are combinations very close to the best one, that have other advantages which make them more desirable than the best one.

This is already possible. Just use a log operator inside of the parameter optimization and log the parameter values together with the performance. In the log operator, you can also specify a sorting type like Top K. Simply define the number of interesting values there as well as the sorting dimension (probably the performance) and the direction.

I would always like to see the final model in the end. Currently, this is not possible with all operators. e.g. the optimize selection operator trains a model, but does not allow you the see the final model in the end, without adding another model training step using the selected attributes

The problem here is that not always a model would be the result of a parameter optimization. Sometimes, the result would consist of several models (e.g. a model, some preprocessing models, and a word list for text processing). Sometimes, other results are generated and sometimes no results are generated at all. It will be difficult to handle this in general while keeping compatibility but I am open for suggestions here.

Stacking using probabilities instead/in addition to final labels. See http://rapid-i.com/rapidforum/index.php/topic,2744.0.html

Good point. See my comments in this thread.

Stop subprocess button, allowing you to exit of an "infinite loop" without canceling the entire process. See end of first post in http://rapid-i.com/rapidforum/index.php/topic,2745.0.html

Also a very useful point but a bit difficult to handle in general. Please see my comments in the other thread.

Difficult to implement and not so important: Graphical representations of more models, e.g. a 2D-Representation of SVM, displaying how the boundary separates the data. Something like here: http://kernelsvm.tripod.com/

I also would really like this and I am sure that in the future we will add additional model visualizations like the ones described there.

Thanks again for sending this in! I hope that my suggestions for the first two points help you already right now.

Cheers,
Ingo

spitfire_ch

Hi,

This is actually already implemented - at least for the leaf nodes. If you keep the mouse over a leaf node, a tooltip window will pop up showing more information. The inner nodes only show the total number of examples in this subtree, not their distribution until there. We could try to add the distribution numbers there as well.

Oh, sorry, this is embarrassing. I've always noticed the tooltip in neural nets, but somehow not in decision trees. Seems, I wasn't patient enough. I tried again and of course you're right, the information is right there! Sorry for that and thank you for correcting me on this - I might never have realized this useful feature is already there

The problem here is that not always a model would be the result of a parameter optimization. Sometimes, the result would consist of several models (e.g. a model, some preprocessing models, and a word list for text processing). Sometimes, other results are generated and sometimes no results are generated at all. It will be difficult to handle this in general while keeping compatibility but I am open for suggestions here.

I am a bit confused here. Most optimization operators allow you to see the performance in the end. What model is this performance based on? Isn't it the model with the most optimized parameters / selection?

Thanks a ton for having taken the time to answer to my (sometimes rather stupid) questions / suggestions. This is really highly appreciated. Your support is exemplary - as is Rapidminer!

Kind regards
Hanspeter