RFM - nth selection process to create a test sample in Rapid Miner . Can someone assist
Given a scored RFM master file , i would like to extract a nth selection test sample . Eg. if the nth slection is 10 then the sample will consist of every 10th record and should create a statistically similar test sample .
400,000 fille will result in a test file 40,00 examples.
Colin
Best Answers
-
I don't claim efficiency or beauty but the code below ought to work.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.5.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="6.5.002" expanded="true" height="60" name="Retrieve Deals" width="90" x="179" y="120">
<parameter key="repository_entry" value="//Samples/data/Deals"/>
</operator>
<operator activated="true" class="generate_id" compatibility="6.5.002" expanded="true" height="76" name="Generate ID" width="90" x="380" y="120"/>
<operator activated="true" breakpoints="after" class="generate_attributes" compatibility="6.5.002" expanded="true" height="76" name="Generate Attributes" width="90" x="581" y="120">
<list key="function_descriptions">
<parameter key="sampled" value="mod(id,10)"/>
</list>
</operator>
<operator activated="true" class="filter_examples" compatibility="6.5.002" expanded="true" height="94" name="Filter Examples" width="90" x="849" y="120">
<list key="filters_list">
<parameter key="filters_entry_key" value="sampled.eq.0"/>
</list>
</operator>
<connect from_op="Retrieve Deals" from_port="output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
You are probably aware of this, but there is also a "sample" operator--it doesn't take exactly every nth record, but it does have parameters for taking either an absolute number of records or a percentage randomly, and if you set the random seed then the results will be reproducible. For most purposes, typically a random sample is sufficient (and may even be preferable) compared to a sample based on a heuristic such as "every nth record."
1
Answers
-
I don't claim efficiency or beauty but the code below ought to work.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.5.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="6.5.002" expanded="true" height="60" name="Retrieve Deals" width="90" x="179" y="120">
<parameter key="repository_entry" value="//Samples/data/Deals"/>
</operator>
<operator activated="true" class="generate_id" compatibility="6.5.002" expanded="true" height="76" name="Generate ID" width="90" x="380" y="120"/>
<operator activated="true" breakpoints="after" class="generate_attributes" compatibility="6.5.002" expanded="true" height="76" name="Generate Attributes" width="90" x="581" y="120">
<list key="function_descriptions">
<parameter key="sampled" value="mod(id,10)"/>
</list>
</operator>
<operator activated="true" class="filter_examples" compatibility="6.5.002" expanded="true" height="94" name="Filter Examples" width="90" x="849" y="120">
<list key="filters_list">
<parameter key="filters_entry_key" value="sampled.eq.0"/>
</list>
</operator>
<connect from_op="Retrieve Deals" from_port="output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
thank you very much .
Quite simple using the generate ID and then generating sample using the modulus function then filter all with mod 0 .
Excellent
Colin
0 -
Hi,
you can make it a bit more efficient with the Filter Example's option to use an expression right away. With that you can save the overhead of Generate Attribute and adding a new column. You simply enter there an expression that evaluates to true or false, where you can use the mod function on the id as in the example above.
Greetings,
Sebastian
0 -
0
-
You are probably aware of this, but there is also a "sample" operator--it doesn't take exactly every nth record, but it does have parameters for taking either an absolute number of records or a percentage randomly, and if you set the random seed then the results will be reproducible. For most purposes, typically a random sample is sufficient (and may even be preferable) compared to a sample based on a heuristic such as "every nth record."
1 -
thanks for refining it
0