🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"Some bugs connected with data sampling"

User: "wokon"
New Altair Community Member
Updated by Jocelyn
When trying to sample data out of a big data set I stumbled over several errors connected with data sampling
  • 1. Changing rapidminer.general.randomseed = -1 in Tools – Preferences has not the desired effect: Reopening the Preferences window shows that rapidminer.general.randomseed is set to 1 instead to -1. When running ExampleSource with sample_ratio=0.5, you get always the same sequence.
  • 2. Changing rapidminer.general.randomseed = -1 in .rapidminer/ 4_2_0_rapidminerrc.Windows XP works, now we get different samples in each run. However, a warning message appears when opening the Preferences dialog box: “Illegal value '-1' for parameter 'rapidminer.general.randomseed' has been corrected to '1'.”  (???) Nevertheless, the system behaves still in the same way as if -1 is in effect.
  • 3. Changing now rapidminer.general.randomseed in the Preferences dialog box to any positive value, e.g. 42, and "Apply" & "Save" leaves the random behaviour untouched (different samples in each run).  Only when restarting RapidMiner, the new setting "42" takes effect >> the same sample A is produced in every run.
  • 4. Changing rapidminer.general.randomseed in the Preferences dialog box to any other positive value, e.g. 84, and "Apply" & "Save" leaves the random behaviour untouched (same sample A in each run).  Only when restarting RapidMiner, the new setting "84" takes effect >> a new and always same sample B is produced in every run.
  • 5. When having rapidminer.general.randomseed = -1 only sample_ratio<1.0 will have the effect of generating different samples in each run. When sample_ratio=1.0 and sample_size=1000 (in a 50000-record dataset), then each run will produce the same sequence of 1000 records, not 1000 different records. So there seems to be no randomness in sample_size.
  • 6. Most disturbing: If I use the operator Sampling  and set its parameter local_random_seed to any value different from -1, then any incoming dataset is reduced to 0 records on output, irrespective how large the sample_ratio is!!. This leaves me in a rather puzzled state  ???
I'm using RapidMiner 4.2 under Windows XP and this is the code I use together with the AML- and DAT-file in the attachment:

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSource" class="ExampleSource" breakpoints="after">
        <parameter key="attributes" value="dmc2007_train_small.aml"/>
        <parameter key="sample_ratio" value="0.5"/>
    </operator>
    <operator name="Sampling" class="Sampling">
        <parameter key="sample_ratio" value="0.2"/>
    </operator>
</operator>
Am I really the first one noting this somewhat strange behaviour or am I doing something in an unexpected way? Isn't it strange that there is no way to achieve a "random random seed" by any means from the GUI, although the tooltip says, that -1 would do it?

Any clarifications are greatly appreciated

Best regards

Wolfgang

[attachment deleted by admin]

Find more posts tagged with