Complex Data Preparation

Ilya
New Altair Community Member
Hello everyone,
First of all, it's important to say that I've been following this forum for some time now, and it helped me a lot – so thank you!
Now it's finally my turn to ask for help, and I really hope you could help me out
I'm working on a project requiring machine-learning, currently using a SVM model in Weka, while the data preparation is done by code.
Now I am tasked with transferring all the coded data preparations into RM, but I'm having difficulties with it.
I'll try to simplify the problem.
Let's say we are trying to predict which students will be suitable for the high school basketball team, using the age and height as attributes.
Basically I'm creating features for SVM using every combination of the attributes, in this case using two (in reality I'm currently up to four attributes, possibly more to come…)
2. Is it possible to define const-arrays in RM? Now I'm using additional exampleSets as arrays…
3. Should I even use RM for this kind of data preparation? Or the best practice is to do it by other means, and import the result into RM for further use (i.e. classification and regression)
4. I would be really grateful if someone could give a RM example for the above basketball data preparation
Thanks in advance!!
First of all, it's important to say that I've been following this forum for some time now, and it helped me a lot – so thank you!
Now it's finally my turn to ask for help, and I really hope you could help me out

I'm working on a project requiring machine-learning, currently using a SVM model in Weka, while the data preparation is done by code.
Now I am tasked with transferring all the coded data preparations into RM, but I'm having difficulties with it.
I'll try to simplify the problem.
Let's say we are trying to predict which students will be suitable for the high school basketball team, using the age and height as attributes.
Basically I'm creating features for SVM using every combination of the attributes, in this case using two (in reality I'm currently up to four attributes, possibly more to come…)
1. I've tried all of the Loop operators to create nested loops, but the process became extremely cumbersome and eventually did not work.
foreach student : exampleSet // from repository
foreach age : constAgeArray // [8, 9, 10]
foreach height : constHeightArray // [130, 135, 140]
if (student.age < age && student.height > height)
// set feature BASKETBALL_POTENTIAL_{age}_{height} = 1
else
// set feature BASKETBALL_POTENTIAL_{age}_{height} = 0
2. Is it possible to define const-arrays in RM? Now I'm using additional exampleSets as arrays…
3. Should I even use RM for this kind of data preparation? Or the best practice is to do it by other means, and import the result into RM for further use (i.e. classification and regression)
4. I would be really grateful if someone could give a RM example for the above basketball data preparation

Thanks in advance!!
Tagged:
0
Answers
-
Guys, this is still relevant...
I would really appreciate your help.0 -
Hi,
You can probably do this via existing operators, however I think the process would be quite complex.
In this case I'd actually recommend the "Execute Script" operator (unless you want to run this on the Server often). I have created a small example on how this could look:
Input data:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.1.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="6.1.001" expanded="true" height="60" name="Retrieve CustomFeatureCreationData" width="90" x="45" y="30">
<parameter key="repository_entry" value="CustomFeatureCreationData"/>
</operator>
<operator activated="true" class="execute_script" compatibility="6.1.001" expanded="true" height="94" name="Execute Script" width="90" x="179" y="30">
<parameter key="script" value="import java.util.LinkedList; import java.util.List; import com.rapidminer.example.Attribute; import com.rapidminer.example.ExampleSet; import com.rapidminer.example.table.AttributeFactory; import com.rapidminer.example.table.DoubleArrayDataRow; import com.rapidminer.example.table.MemoryExampleTable; import com.rapidminer.tools.Ontology; // grab input data ExampleSet exampleSet = input[0]; // define constants over which to loop below int[] ageArray = new int[3]; ageArray[0] = 8; ageArray[1] = 9; ageArray[2] = 10; int[] heightArray = new int[3]; heightArray[0] = 130; heightArray[1] = 135; heightArray[2] = 140; // loop over all examples (aka rows) in the data for (Example example : exampleSet) { 	// loop over all constant arrays 	for (int i=0; i<ageArray.length; i++) { 		for (int j=0; j<heightArray.length; j++) { 			// grab data from example 			int age = (int) example.getValue(exampleSet.getAttributes().get("Age")); 			int height = (int) example.getValue(exampleSet.getAttributes().get("Height")); 			// check if attribute (aka column) already exists 			String attName = "BASKETBALL_POTENTIAL_" + ageArray + "_" + heightArray; 			Attribute newAtt = exampleSet.getAttributes().get(attName); 			if (newAtt == null) { 				// does not yet exist, create it 				newAtt = AttributeFactory.createAttribute(attName, Ontology.ATTRIBUTE_VALUE_TYPE.NUMERICAL); 				exampleSet.getExampleTable().addAttribute(newAtt); 				exampleSet.getAttributes().addRegular(newAtt); 			} 			// fill newly added attributes with desired values 			if (age < ageArray && height > heightArray) { 				example.setValue(newAtt, 1); 			} else { 				example.setValue(newAtt, 0); 			} 		} 	} } // return input data return exampleSet;"/>
</operator>
<connect from_op="Retrieve CustomFeatureCreationData" from_port="output" to_op="Execute Script" to_port="input 1"/>
<connect from_op="Execute Script" from_port="output 1" to_port="result 1"/>
<connect from_op="Execute Script" from_port="output 2" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Result:
Robert Hanson 8.0 120.0
Dennis Muller 9.0 150.0
Joe Stevens 9.0 110.0
Marc Bold 7.0 135.0
Bill Holmes 8.0 110.0
Regards,
"Name" "Age" "Height" "BASKETBALL_POTENTIAL_8_130" "BASKETBALL_POTENTIAL_8_135" "BASKETBALL_POTENTIAL_8_140" "BASKETBALL_POTENTIAL_9_130" "BASKETBALL_POTENTIAL_9_135" "BASKETBALL_POTENTIAL_9_140" "BASKETBALL_POTENTIAL_10_130" "BASKETBALL_POTENTIAL_10_135" "BASKETBALL_POTENTIAL_10_140"
"Robert Hanson" 8.0 120.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
"Dennis Muller" 9.0 150.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
"Joe Stevens" 9.0 110.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
"Marc Bold" 7.0 135.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
"Bill Holmes" 8.0 110.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Marco0 -
Thank you!
I'll check this as soon as I get to work.0