RFM - nth selection process to create a test sample in Rapid Miner . Can someone assist

cwoo
cwoo New Altair Community Member
edited November 5 in Community Q&A

Given a  scored RFM  master file  , i  would like to  extract a  nth  selection  test sample . Eg.  if the nth  slection is  10  then the sample  will consist  of   every  10th  record  and should create  a statistically  similar  test sample . 

 

400,000  fille  will  result  in a  test file  40,00  examples.

 

Colin 

 

Best Answers

  • earmijo
    earmijo New Altair Community Member
    Answer ✓

    I don't claim efficiency or beauty but the code below ought to work. 

     

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.5.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="6.5.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="6.5.002" expanded="true" height="60" name="Retrieve Deals" width="90" x="179" y="120">
    <parameter key="repository_entry" value="//Samples/data/Deals"/>
    </operator>
    <operator activated="true" class="generate_id" compatibility="6.5.002" expanded="true" height="76" name="Generate ID" width="90" x="380" y="120"/>
    <operator activated="true" breakpoints="after" class="generate_attributes" compatibility="6.5.002" expanded="true" height="76" name="Generate Attributes" width="90" x="581" y="120">
    <list key="function_descriptions">
    <parameter key="sampled" value="mod(id,10)"/>
    </list>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="6.5.002" expanded="true" height="94" name="Filter Examples" width="90" x="849" y="120">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="sampled.eq.0"/>
    </list>
    </operator>
    <connect from_op="Retrieve Deals" from_port="output" to_op="Generate ID" to_port="example set input"/>
    <connect from_op="Generate ID" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓

    You are probably aware of this, but there is also a "sample" operator--it doesn't take exactly every nth record, but it does have parameters for taking either an absolute number of records or a percentage randomly, and if you set the random seed then the results will be reproducible.  For most purposes, typically a random sample is sufficient (and may even be preferable) compared to a sample based on a heuristic such as "every nth record."

     

Answers

  • earmijo
    earmijo New Altair Community Member
    Answer ✓

    I don't claim efficiency or beauty but the code below ought to work. 

     

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.5.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="6.5.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="6.5.002" expanded="true" height="60" name="Retrieve Deals" width="90" x="179" y="120">
    <parameter key="repository_entry" value="//Samples/data/Deals"/>
    </operator>
    <operator activated="true" class="generate_id" compatibility="6.5.002" expanded="true" height="76" name="Generate ID" width="90" x="380" y="120"/>
    <operator activated="true" breakpoints="after" class="generate_attributes" compatibility="6.5.002" expanded="true" height="76" name="Generate Attributes" width="90" x="581" y="120">
    <list key="function_descriptions">
    <parameter key="sampled" value="mod(id,10)"/>
    </list>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="6.5.002" expanded="true" height="94" name="Filter Examples" width="90" x="849" y="120">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="sampled.eq.0"/>
    </list>
    </operator>
    <connect from_op="Retrieve Deals" from_port="output" to_op="Generate ID" to_port="example set input"/>
    <connect from_op="Generate ID" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • cwoo
    cwoo New Altair Community Member

    thank you very much .

    Quite simple using the generate ID   and then  generating  sample  using the modulus  function  then filter all with  mod 0 .

     

    Excellent 

     

    Colin

  • land
    land New Altair Community Member

    Hi,

     

    you can make it a bit more efficient with the Filter Example's option to use an expression right away. With that you can save the overhead of Generate Attribute and adding a new column. You simply enter there an expression that evaluates to true or false, where you can use the mod function on the id as in the example above.

     

    Greetings,

      Sebastian

  • bhupendra_patil
    bhupendra_patil New Altair Community Member
  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓

    You are probably aware of this, but there is also a "sample" operator--it doesn't take exactly every nth record, but it does have parameters for taking either an absolute number of records or a percentage randomly, and if you set the random seed then the results will be reproducible.  For most purposes, typically a random sample is sufficient (and may even be preferable) compared to a sample based on a heuristic such as "every nth record."

     

  • cwoo
    cwoo New Altair Community Member

    thanks for refining it