Getting different results when loading a process vs coding it
behrangsa
New Altair Community Member
I want to create the example text clustering process using the Java APIs. Here's a copy of the original process that comes with the Examples bundle:
Now I want to create this process using the Java API. Here's my code:
Thanks in advance,
Behi
When I run this using this code:
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Clustering text documents#ylt#/h3#ygt##ylt#p#ygt#In this experiment, texts from two newsgroups are read and clustered. To make the clusters better comprehensible, three keywords are extracted for each cluster and added to the cluster description.#ylt#/p#ygt#"/>
<parameter key="logverbosity" value="status"/>
<operator name="TextInput" class="TextInput" expanded="yes">
<parameter key="default_content_language" value="english"/>
<list key="namespaces">
</list>
<parameter key="prune_above" value="10"/>
<parameter key="prune_below" value="5"/>
<list key="texts">
<parameter key="graphics" value="../data/newsgroup/graphics"/>
<parameter key="hardware" value="../data/newsgroup/hardware"/>
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="5"/>
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="KMeans" class="KMeans">
</operator>
<operator name="AttributeSumClusterCharacterizer" class="AttributeSumClusterCharacterizer">
</operator>
</operator>
The result is:
System.setProperty("rapidminer.home", "C:\\Java\\RapidMiner-4.2");
RapidMiner.init();
Process p = new Process(theProcessFile);
p.run();
If I run the process multiple times, I get the same result. So I assume that the initial centroids are not selected randomly and the outcome is always the same.
IOContainer (2 objects):
A cluster model with the following properties:
Cluster 0 [characterization: graphic buffer model]: 11 items
Cluster 1 [characterization: appl memori crabappl]: 9 items
Total number of items: 20
Now I want to create this process using the Java API. Here's my code:
When I save the process to a file, it looks identical to the original process that comes with the examples bundle with the only difference being that it is wrapped inside a <process> element:
System.setProperty("rapidminer.home", "C:\\Java\\RapidMiner-4.2");
RapidMiner.init();
Process p = new Process();
OperatorChain textInput = (OperatorChain) OperatorService.createOperator("TextInput");
textInput.setParameter(PARAMETER_DEFAULT_CONTENT_LANGUAGE, "english");
textInput.setParameter(PARAMETER_PRUNE_ABOVE, "15");
textInput.setParameter(PARAMETER_PRUNE_BELOW, "5");
List<Object[]> textList = new LinkedList<Object[]>();
textList.add(new Object[] {"graphics","newsgroup/graphics"});
textList.add(new Object[] {"hardware","newsgroup/hardware"});
textInput.setListParameter("texts", textList);
textInput.addOperator(OperatorService.createOperator("StringTokenizer"));
textInput.addOperator(OperatorService.createOperator("EnglishStopwordFilter"));
Operator tlfOperator = OperatorService.createOperator("TokenLengthFilter");
tlfOperator.setParameter("min_chars", "5");
textInput.addOperator(tlfOperator);
textInput.addOperator(OperatorService.createOperator("PorterStemmer"));
p.getRootOperator().addOperator(textInput);
p.getRootOperator().addOperator(OperatorService.createOperator("KMeans"));
p.getRootOperator().addOperator(OperatorService.createOperator("AttributeSumClusterCharacterizer"));
System.out.println(p.getRootOperator().createProcessTree(1));
p.save(new File("Process.xml"));
p.run();
However the result of running the process is different compared to the original process:
<?xml version="1.0" encoding="windows-1252"?>
<process version="4.2">
<operator name="Root" class="Process" expanded="yes">
<operator name="TextInput" class="TextInput" expanded="yes">
<parameter key="default_content_language" value="english"/>
<parameter key="prune_above" value="15"/>
<parameter key="prune_below" value="5"/>
<list key="texts">
<parameter key="graphics" value="newsgroup/graphics"/>
<parameter key="hardware" value="newsgroup/hardware"/>
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="5"/>
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="KMeans" class="KMeans">
</operator>
<operator name="AttributeSumClusterCharacterizer" class="AttributeSumClusterCharacterizer">
</operator>
</operator>
</process>
Any ideas what is causing this?
IOContainer (2 objects):
A cluster model with the following properties:
Cluster 0 [characterization: graphic buffer memori]: 12 items
Cluster 1 [characterization: appl state problem]: 8 items
Total number of items: 20
Thanks in advance,
Behi
Tagged:
0
Answers
-
Hi,
yes, they are always the same but the reason is not that the centroids are not randomly chosen. They actually are. But in RM, it is ensured that repetitions of processes always lead to the same results by ensuring that the sequence of used random numbers is always the same for a specific process. By the way, this behaviour can be changed by setting the random seed parameter of the root operator to -1.
If I run the process multiple times, I get the same result. So I assume that the initial centroids are not selected randomly and the outcome is always the same.
The reason for the difference could be the value of "prune_above". It's 10 in the original process and 15 in yours.
Cheers,
Ingo0