"Some bugs connected with data sampling"
When trying to sample data out of a big data set I stumbled over several errors connected with data sampling
Any clarifications are greatly appreciated
Best regards
Wolfgang
[attachment deleted by admin]
- 1. Changing rapidminer.general.randomseed = -1 in Tools – Preferences has not the desired effect: Reopening the Preferences window shows that rapidminer.general.randomseed is set to 1 instead to -1. When running ExampleSource with sample_ratio=0.5, you get always the same sequence.
- 2. Changing rapidminer.general.randomseed = -1 in .rapidminer/ 4_2_0_rapidminerrc.Windows XP works, now we get different samples in each run. However, a warning message appears when opening the Preferences dialog box: “Illegal value '-1' for parameter 'rapidminer.general.randomseed' has been corrected to '1'.” (???) Nevertheless, the system behaves still in the same way as if -1 is in effect.
- 3. Changing now rapidminer.general.randomseed in the Preferences dialog box to any positive value, e.g. 42, and "Apply" & "Save" leaves the random behaviour untouched (different samples in each run). Only when restarting RapidMiner, the new setting "42" takes effect >> the same sample A is produced in every run.
- 4. Changing rapidminer.general.randomseed in the Preferences dialog box to any other positive value, e.g. 84, and "Apply" & "Save" leaves the random behaviour untouched (same sample A in each run). Only when restarting RapidMiner, the new setting "84" takes effect >> a new and always same sample B is produced in every run.
- 5. When having rapidminer.general.randomseed = -1 only sample_ratio<1.0 will have the effect of generating different samples in each run. When sample_ratio=1.0 and sample_size=1000 (in a 50000-record dataset), then each run will produce the same sequence of 1000 records, not 1000 different records. So there seems to be no randomness in sample_size.
- 6. Most disturbing: If I use the operator Sampling and set its parameter local_random_seed to any value different from -1, then any incoming dataset is reduced to 0 records on output, irrespective how large the sample_ratio is!!. This leaves me in a rather puzzled state ???
Am I really the first one noting this somewhat strange behaviour or am I doing something in an unexpected way? Isn't it strange that there is no way to achieve a "random random seed" by any means from the GUI, although the tooltip says, that -1 would do it?
<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSource" class="ExampleSource" breakpoints="after">
<parameter key="attributes" value="dmc2007_train_small.aml"/>
<parameter key="sample_ratio" value="0.5"/>
</operator>
<operator name="Sampling" class="Sampling">
<parameter key="sample_ratio" value="0.2"/>
</operator>
</operator>
Any clarifications are greatly appreciated
Best regards
Wolfgang
[attachment deleted by admin]