Suggestions/feature requests prompted by RM training course

I had the pleasure of taking the RapidMiner training course with Ralf in New York last week. It was very much worthwhile, and I learned a lot about how to use RapidMiner more effectively. I also came across several places where I thought RM could be improved, or where it might not be working as intended. Some of these might already be in the plans for RM 4.3 or 5.0, but I'm going to list everything that I noticed that would be something we would be definitely be interested in seeing.

UI/General improvements
=======================

1) Option to keep each input to a node -- right now some nodes have a "keep" option which allows you to preserve one of the inputs, such as an example set. However, this seems to be inconsistently offered across different nodes, and in some cases only some inputs can be preserved, but not others. While this can be worked around with the IOMultiplier, it would be nice if you had the option to keep any of the otherwise-consumed inputs on each node.

2) Improve representation of constants in mathematical expressions. Having to use "const[1]()" instead of "1" is a pain to remember.

3) In a process chain, have an indicator that shows whether operator has a comment or not. Right now there is no way to tell from the GUI which nodes have comments without inspecting each one.

4) I would like to see more formatting options for graphs, such as being able to select the font, point size, number format, etc.

5) Rearrangeable/dockable UI components. This is admittedly a bigger project, but I would like to be able to rearrange the components of the GUI more. Particularly with multiple monitor setups, the flexibility to move seldom referenced components (like the memory monitor) off the main working area, but still be visible with a glance, would be nice. Also, I'd like to have the option to see both the process chain and the results at the same time, and right now I have to choose one view or the other.

Modeling
========

6) I've mentioned this in other threads on the forum, but I use Evolutionary Weighting a lot, and I'd like to be able to:

(a) start Evolutionary Weighting using an initial weight vector (possibly from a previous run, or from a simpler model to get a good starting point),
(b) be able to manually pause an Evolutionary Weighting node and get the current values for the weights (or write them to a file)
(c) be able to extract weights from Evo Wgt periodically during a run, so I have a "best known" values if I have to terminate the model run, or if the system crashes.

These ideas work together. In the case of (b) or (c), the saved values of the weights can be passed to the model as initial values for the next model run if (a) is implemented.

7) Have a random selection learner, which for classification problems guesses one of the label values, in the proportion they were present in the training set. This is a "base case" simple learner, so you can compare other learners to see how much better they are than random guessing.

Non-modeling functionality
==========================

8 ) Add the ability to do 'stratified' sampling across continuous numerical labels. What I'm looking for is something that will guarantee that I have representatives in the sample from the rare portions of the label's distribution. In my case, I'm most interested in predicting values that are at the high end of the distribution, which occur infrequently, but are of the most value. I want to make sure that they are included when I take a sample of the data to do training on.

9) Allow objects to be named and referenced with IOMultiplier, IOSelector, etc, rather than just index number (hard to keep track of)

10) Model Merge -- be able to combine two models, such as a preprocessing model and a learner model, into a single model object that can be applied later on, rather than having to maintain two model objects.

11) FeatureIterator does not appear to deliver its results (i.e. Inside FeatureIterator, if you build a seprate model for each feature, those models aren't returned coming out of the iterator).

12) It ParameterIteration, allow sets of scenarios to be specified to iterate over. where each scenario contains a list of parameter values to set for that scenario
e.g. a=1, stat=hr; a=2, stat=ubb, etc. Helps with meta control (output file names).

e.g. I have three labels I want to predict, using MultipleLabelIterator. They are renamed label_1, label_2, label_3, but I want to write the output files with the original attribute names.

Scenario 1: att=1, attname="revenue"
Scenario 2: att=2, attname="cost"
Scenario 3: att=3, attname="profit"

or I have a process chain that is going to test a KNN model, and a SVM model, each with 3 different parameters. I need to specify different parameters for each one, and the others don't make sense for the other model. Thus, I want to test (with model #1 = KNN, model #2 = SVM):

Scenario 1: modelnum=1, k=1
Scenario 2: modelnum=1, k=5
Scenario 3: modelnum=1, k=10
Scenario 4: modelnum=2, C=1.0
Scenario 5: modelnum=2, C=2.0
Scenario 6: modelnum=2, C=5.0

This approach could also used in grid parameter optimization. Right now it does all combinations of the parameters, even if they don't make logical sense (i.e. testing "modelnum=1, k=1, C=1.0" and "modelnum=1, k=1, C=2.0" in the example above.

Save/output functionality
=========================

13) Have an option to explicitly list values of default options in XML file, rather than only printing values that differ from defaults. This makes it easier to modify later, as well as making it more self-documenting. The current approach could still be offered as an option to minimize the size of the XML files.

14) It would be nice to have a way to capture the underlying data from the ROC plots as a CSV. Right now you can only view plots, can't actually get the computed data used to build them other than in RM's XML format if you save the object.

15) The correlation matrix should allow you to create both the matrix and attribute weights (not either/or, which is the current behavior)

16) In ProcessLog, show what performance measure is actually being tracked, rather than the generic "Performance" title. It can be hard to remember which measure was selected, and which direction is "good" without inspecting the process.

17) I think there's a bug in ExampleSetWriter -- I have a dataset (read from a database) where if I write using ExampleSetWriter, it fails when I try to read it in using ExampleSource. I think the problem is that there are nominal values in my data that are just whitespace, and from inspecting the output file, it looks like such values don't get quoted when written. I can work around this by using CSVExampleWriter instead, but it seems like ExampleSource should be able to read whatever ExampleSetWriter saves.

18) In much the same way that you allow the export of graphics to GNUPlot, I'd like to be able to write them in a way that could be manipulated with R, which is the other main open-source statistics package I use. I could then ensure that all the charts that I produce have a consistent look and feel to them.

RapidMiner is a great product, and I feel like I have a much better grasp on its capabilities having taken the course. Hopefully, some of these suggestions will be useful in helping to guide the future development and make it even better.

Thanks,
Keith

Find more posts tagged with

AI Studio

Accepted answers

All comments

TobiasMalbrecht

Hi Keith,

thanks for your many suggestions and your kind words concerning RM. Although I (and probably my collegues as well) have not the time to comment each suggestion in detail at the moment, I can say that we already have many of your suggestions on our roadmap for RapidMiner 5.0. Additionally, some points will hopefully be not relevant any more in RM 5.0 due to the re-design aspects we are plannning. Nevertheless, we will definitely scan through your posting again when we will actually start with the development, as it surely contains some interesting points.

Apart from the general answer I would like to comment to aspects specifically:

keith wrote:

2) Improve representation of constants in mathematical expressions. Having to use "const[1]()" instead of "1" is a pain to remember.

We recently re-implemented the whole attribute construction framework. With that re-implementation there is indeed no need to use the cryptic expressions any more. Instead, mathematical expressions can be given in "normal" infix notation. Additionally, there are many functions available which incredibly boost the attribute construction options (e.g. mathematical functions such as log, exp, if-then-else statements, averages, etc.). The now functionality is already part of the newest developer version (and will of course be also part of our next release 4.3 which will probably come in a couple of weeks).

keith wrote:

10) Model Merge -- be able to combine two models, such as a preprocessing model and a learner model, into a single model object that can be applied later on, rather than having to maintain two model objects.

Grouping/ungrouping a preprocessing and a learner model should already be possible with RM 4.2 by using the operators [tt]ModelGrouper[/tt] and [tt]ModelUngrouper[/tt].

Regards,
Tobias

jdouet

@keith

8 ) Add the ability to do 'stratified' sampling across continuous numerical labels. What I'm looking for is something that will guarantee that I have representatives in the sample from the rare portions of the label's distribution. In my case, I'm most interested in predicting values that are at the high end of the distribution, which occur infrequently, but are of the most value. I want to make sure that they are included when I take a sample of the data to do training on.

Is that what you are talking of ? :

http://en.wikipedia.org/wiki/Importance_sampling

Cheers,
Jean-Charles.

keith

That seems similar in concept to what I described, although the details of the algorithm escape me a bit. What I was envisioning was a way to ensure that "shape" of the sample distribution matches the shape of the example set rather than the density.

Upon further inspection, the Kennard-Stone sampling node in RM seems to be pretty much what I want. You get a flat sample that spans the space of the data points. I wish there was a way to add back some of the "density" into the sample, but it's probably good enough for my needs at the moment.

Interestingly, this article mentions some problems with Kennard-Stone in creating training and test data sets for validation, and proposes a modification that addresses it:

http://www.vub.ac.be/fabi/multi/pcr/chaps/chap10.html

steffen

Hello

[quote author=jdouet]
Is that what you are talking of ? :

http://en.wikipedia.org/wiki/Importance_sampling
[/quote]

As far as I see, importance sampling can help you to estimate the properties but does not give you a sample at all. So ...

Why not simply performing a discretization of the numerical attribute and then use this attribute within stratified sampling ? This is of course not a way to get a 100% probability that certain data points are selected into the sample ... but if you want that guarantee we should skip the words "random" and "sampling"

greetings,

Steffen