Dear all,
My name is Jon and I have come across RM in my research for a Software Engineering thesis at the University of Sydney, Australia. I can see that RM is very powerful. Briefly, my thesis application should monitor web services, collect metrics, perform data transformation, perform outlier detection and notify administrators if any "faults" are detected. In my thesis application, I would like to perform the following operations on my data:
0. Feature collection (collecting various data)
1. RM operator: Data transformation (using PCA, or ICA, or Kernel PCA) - my application will select one of these feature extraction techniques based on how well it performs.
2. RM operator: Outlier detection (using any of the 4 operators, or any new operators that I write) - again, select an operation based on how well it detects outliers.
3: Identify outliers and notify administrators - i.e. get the results of outlier detection.
As you can see, I would like to run the operators like this: 0 --> 1 --> 2 --> 3
Unfortunately, I don't think the white paper would contain basic tutorial information I need at this early stage (maybe later for operator creation) and the wiki page:
http://rapid-i.com/wiki/index.php?title=Integrating_RapidMiner_into_your_application "using single operators" section seems to be outdated (since it uses operator.apply which is deprecated).
OK, so I can do this sequence of operations in the GUI fine, and I can see that I want this process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
<process expanded="true" height="550" width="949">
<operator activated="true" class="retrieve" compatibility="5.0.8" expanded="true" height="60" name="Retrieve" width="90" x="236" y="283">
<parameter key="repository_entry" value="../data/PCATutorial"/>
</operator>
<operator activated="true" class="principal_component_analysis" compatibility="5.0.8" expanded="true" height="94" name="PCA" width="90" x="447" y="210"/>
<operator activated="true" class="apply_model" compatibility="5.0.8" expanded="true" height="76" name="Apply Model" width="90" x="648" y="210">
<list key="application_parameters">
<parameter key="variance_threshold" value="0.95"/>
</list>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="PCA" to_port="example set input"/>
<connect from_op="PCA" from_port="original" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="PCA" from_port="preprocessing model" to_op="Apply Model" to_port="model"/>
<connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="126"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
BTW, I don't especially want to directly use this XML file in process creation, 1. because I want to retrieve the data elsewhere, and 2. because I want to dynamically change what operations are done.
In code I have a file that looks like this:
package thesis.PCA;
import com.rapidminer.tools.OperatorService;
import com.rapidminer.Process;
import com.rapidminer.RapidMiner;
import com.rapidminer.example.Attribute;
import com.rapidminer.example.ExampleSet;
import com.rapidminer.example.table.AttributeFactory;
import com.rapidminer.example.table.DoubleArrayDataRow;
import com.rapidminer.example.table.MemoryExampleTable;
import com.rapidminer.operator.IOContainer;
import com.rapidminer.operator.IOObject;
import com.rapidminer.operator.ModelApplier;
import com.rapidminer.operator.Operator;
import com.rapidminer.operator.OperatorException;
import com.rapidminer.operator.generator.ExampleSetGenerator;
import com.rapidminer.tools.Ontology;
import java.util.LinkedList;
import java.util.List;
import java.util.Random;
public class RMTest {
public static void main(String[] args) {
RapidMiner.init();
// create process
Process process = createProcess();
// print process setup
System.out.println(process.getRootOperator().createProcessTree(0));
// create some input from application
// later, this will be real data
IOContainer input = createInput();
try {
// perform process
process.run(input);
} catch (OperatorException e) {
e.printStackTrace();
}
}
public static Process createProcess() {
// create process
Process process = new Process();
try {
// create operator to create some example data
Operator inputOperator =
OperatorService.createOperator(ExampleSetGenerator.class);
// set parameters
inputOperator.setParameter("target_function", "sum classification");
// PCA
Operator pca =
OperatorService.createOperator(com.rapidminer.operator.features.transformation.PCA.class);
// applying the model
Operator modelApp =
OperatorService.createOperator(ModelApplier.class);
// I believe these 3 lines of code do not connect my operators properly
process.getRootOperator().getSubprocess(0).addOperator(inputOperator);
process.getRootOperator().getSubprocess(0).addOperator(pca);
process.getRootOperator().getSubprocess(0).addOperator(modelApp);
} catch (Exception e) {
e.printStackTrace();
}
return process;
}
// code snippet taken from somewhere else
private static IOContainer createInput() {
List<Attribute> attributes = new LinkedList<Attribute>();
for (int a = 0; a < 10; a++) {
attributes.add(AttributeFactory.createAttribute("a" + a, Ontology.REAL));
}
Attribute label = AttributeFactory.createAttribute("class", Ontology.NOMINAL);
attributes.add(label);
Random rand = new Random();
MemoryExampleTable table = new MemoryExampleTable(attributes);
// Create 8 data intances and fill the data
for (int d = 0; d < 8; d++) {
double[] data = new double[attributes.size()];
for (int dim = 0; dim < 10; dim++) {
data[dim] = rand.nextDouble();
}
if (rand.nextBoolean()) {
data[attributes.size() - 1] = 1d;
} else {
data[attributes.size() - 1] = 0d;
}
table.addDataRow(new DoubleArrayDataRow(data));
}
ExampleSet exampleSet = table.createExampleSet(label);
IOContainer container = new IOContainer(new IOObject[]{exampleSet});
return container;
}
}
And, the stack trace of the error suggests that I have not connected my input data to the PCA port:
com.rapidminer.operator.UserError: No data was deliverd at port PCA.example set input.
at com.rapidminer.operator.ports.impl.AbstractPort.getData(AbstractPort.java:79)
at com.rapidminer.operator.features.transformation.PCA.doWork(PCA.java:132)
at com.rapidminer.operator.Operator.execute(Operator.java:768)
at com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
at com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
at com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:368)
at com.rapidminer.operator.Operator.execute(Operator.java:768)
at com.rapidminer.Process.run(Process.java:863)
at com.rapidminer.Process.run(Process.java:770)
at com.rapidminer.Process.run(Process.java:765)
at thesis.PCA.RMTest.main(RMTest.java:39)
1. It is clear from the XML file above that I need to connect ports, but I cannot find examples of how to do this.
2. Nor can I find examples of how to retrieve any data once the operations have been performed.
Quite a lot of the examples in this forum seem to use code prior to RM5.
I'd appreciate any suggestions any one has.
Many thanks,
Jon