Splitting data
frankie
New Altair Community Member
Hi,
I have what I consider a simple problem but due to poor understanding or perhaps poor documentation I cannot figure out how to:
Split a dataset of say 1000 observations into two separate datasets of say 700 and 300 observations respectively. That is, a operator that has two outputs and one input...
Is this done with the "Split Data" operator? If so, what are these "partitions" I need to define?
The split should be random, preferably with a predefined seed for reproducibility.
-frankie
I have what I consider a simple problem but due to poor understanding or perhaps poor documentation I cannot figure out how to:
Split a dataset of say 1000 observations into two separate datasets of say 700 and 300 observations respectively. That is, a operator that has two outputs and one input...
Is this done with the "Split Data" operator? If so, what are these "partitions" I need to define?
The split should be random, preferably with a predefined seed for reproducibility.
-frankie
Tagged:
0
Answers
-
Frankie:
Yes you can do it easily in RM. Take a look at the code below. It uses the operator "Split Data". It splits the iris dataset into 2 partitions: 70/30%. This info is fed to RM clicking the "Edit Enumeration" button. Notice you could have k partitions by adding k ratios.
If you select the option "local random seed" the partitions will be the same in repeated trials.
Hope this helps.<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.001" expanded="true" name="Process">
<process expanded="true" height="179" width="346">
<operator activated="true" class="retrieve" compatibility="5.1.001" expanded="true" height="60" name="Retrieve" width="90" x="74" y="62">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="split_data" compatibility="5.1.001" expanded="true" height="94" name="Split Data" width="90" x="246" y="75">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
<parameter key="use_local_random_seed" value="true"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_port="result 1"/>
<connect from_op="Split Data" from_port="partition 2" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>0 -
Thank you!0