[NullPointerException] Text classification problem
Hi!
I'm trying to apply the code from https://blog.codecentric.de/en/2013/03/java-based-machine-learning-by-classification/ on my process which tests the classification of Arabic texts. I made the training and testing in two separate processes. Now I only need the testing process.
Here's the XML
final IOObject wordlist = ((IOObjectEntry)
locWordList.locateEntry()).retrieveData(null);
Thank you in advance
I'm trying to apply the code from https://blog.codecentric.de/en/2013/03/java-based-machine-learning-by-classification/ on my process which tests the classification of Arabic texts. I made the training and testing in two separate processes. Now I only need the testing process.
Here's the XML
<?xml version="1.0" encoding="UTF-8" standalone="no"?>and here is the java code
<process version="5.3.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true" height="414" width="762">
<operator activated="true" class="retrieve" compatibility="5.3.000" expanded="true" height="60" name="Retrieve" width="90" x="112" y="75">
<parameter key="repository_entry" value="wordlistAr"/>
</operator>
<operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="246" y="30">
<list key="text_directories">
<parameter key="أخبار" value="C:\Users\WINDOWS 7\Desktop\rapid2\AraTest\New folder"/>
</list>
<parameter key="file_pattern" value="*"/>
<parameter key="extract_text_only" value="true"/>
<parameter key="use_file_extension_as_type" value="true"/>
<parameter key="content_type" value="txt"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="TF-IDF"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="false"/>
<parameter key="prune_method" value="none"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<process expanded="true" height="414" width="762">
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:filter_stopwords_arabic" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (Arabic)" width="90" x="45" y="165"/>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.002" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="45" y="255">
<parameter key="max_length" value="1"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Arabic)" to_port="document"/>
<connect from_op="Filter Stopwords (Arabic)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="retrieve" compatibility="5.3.000" expanded="true" height="60" name="Retrieve (2)" width="90" x="447" y="30">
<parameter key="repository_entry" value="modelAr"/>
</operator>
<operator activated="true" class="apply_model" compatibility="5.3.000" expanded="true" height="76" name="Apply Model" width="90" x="514" y="165">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Process Documents from Files" to_port="word list"/>
<connect from_op="Process Documents from Files" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Retrieve (2)" from_port="output" to_op="Apply Model" to_port="model"/>
<connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
// Path to process-definitionit says the problem is with this line
final String processPath =
"C:/Users/WINDOWS 7/.RapidMiner5/repositories/NewLocalRepository/TestNews.rmp";
// Init RapidMiner
RapidMiner.setExecutionMode(ExecutionMode.COMMAND_LINE);
RapidMiner.init();
try
{
// Load process
final com.rapidminer.Process process =
new com.rapidminer.Process(new File(processPath));
// Load learned model
final RepositoryLocation locWordList = new RepositoryLocation(
"//NewLocalRepository/modelAr.model");
final IOObject wordlist = ((IOObjectEntry)
locWordList.locateEntry()).retrieveData(null);
// Load Wordlist
final RepositoryLocation locModel = new RepositoryLocation(
"//NewLocalRepository/wordlistAr.wordlist");
final IOObject model = ((IOObjectEntry)
locModel.locateEntry()).retrieveData(null);
final IOContainer ioInput = new IOContainer(new IOObject[] { wordlist, model });
process.run(ioInput);
process.run(ioInput);
final long start = System.currentTimeMillis();
final IOContainer ioResult = process.run();
final long end = System.currentTimeMillis();
System.out.println("T:" + (end - start));
// Print some results
final SimpleExampleSet ses = ioResult.get(SimpleExampleSet.class);
for (int i = 0; i < Math.min(5, ses.size()); i++) {
final Example example = ses.getExample(i);
final Attributes attributes = example.getAttributes();
final String id = example.getValueAsString(attributes.getId());
final String prediction = example.getValueAsString(
attributes.getPredictedLabel());
System.out.println("Path: " + id + ":\tPrediction:" + prediction);
}
}
catch(Exception e)
{e.printStackTrace();}
}
final IOObject wordlist = ((IOObjectEntry)
locWordList.locateEntry()).retrieveData(null);
Thank you in advance
Find more posts tagged with
Sort by:
1 - 22 of
221
Hi!
I did as you said and the process started. But I got the error "Cannot resolve relative repository location 'C:\Users\WINDOWS 7\.RapidMiner5\repositories\NewLocalRepository\wordlistAr'. Process is not associated with a repository."
So I associated the process with repository and got the xml file
What should I do?
Regarding the multiple runs, they were there in the original code but I forgot to comment them out.
Thank you
I did as you said and the process started. But I got the error "Cannot resolve relative repository location 'C:\Users\WINDOWS 7\.RapidMiner5\repositories\NewLocalRepository\wordlistAr'. Process is not associated with a repository."
So I associated the process with repository and got the xml file
RepositoryLocation pLoc = new RepositoryLocation("//NewLocalRepository/TestNews");But I still get the same error of "Cannot resolve relative repository location" though the path to "wordlistAr" in the process is not relative ! :-\
ProcessEntry pEntry = (ProcessEntry) pLoc.locateEntry();
String processXML = pEntry.retrieveXML();
Process process = new Process(processXML);
What should I do?
Regarding the multiple runs, they were there in the original code but I forgot to comment them out.
Thank you
This is the XML of the testing process
<?xml version="1.0" encoding="UTF-8" standalone="no"?>and this is the Java code
<process version="5.3.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true" height="414" width="762">
<operator activated="true" class="retrieve" compatibility="5.3.000" expanded="true" height="60" name="Retrieve" width="90" x="112" y="75">
<parameter key="repository_entry" value="C:\Users\WINDOWS 7\.RapidMiner5\repositories\NewLocalRepository\wordlistAr"/>
</operator>
<operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="246" y="30">
<list key="text_directories">
<parameter key="أخبار" value="C:\Users\WINDOWS 7\Desktop\rapid2\AraTest\New folder"/>
</list>
<parameter key="file_pattern" value="*"/>
<parameter key="extract_text_only" value="true"/>
<parameter key="use_file_extension_as_type" value="true"/>
<parameter key="content_type" value="txt"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="TF-IDF"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="false"/>
<parameter key="prune_method" value="none"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<process expanded="true" height="414" width="762">
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:filter_stopwords_arabic" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (Arabic)" width="90" x="45" y="165"/>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.002" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="45" y="255">
<parameter key="max_length" value="1"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Arabic)" to_port="document"/>
<connect from_op="Filter Stopwords (Arabic)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="retrieve" compatibility="5.3.000" expanded="true" height="60" name="Retrieve (2)" width="90" x="447" y="30">
<parameter key="repository_entry" value="C:\Users\WINDOWS 7\.RapidMiner5\repositories\NewLocalRepository\modelAr"/>
</operator>
<operator activated="true" class="apply_model" compatibility="5.3.000" expanded="true" height="76" name="Apply Model" width="90" x="514" y="165">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Process Documents from Files" to_port="word list"/>
<connect from_op="Process Documents from Files" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Retrieve (2)" from_port="output" to_op="Apply Model" to_port="model"/>
<connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
RapidMiner.setExecutionMode(ExecutionMode.COMMAND_LINE);I'm getting this error "Cannot resolve relative repository location 'C:\Users\WINDOWS 7\.RapidMiner5\repositories\NewLocalRepository\wordlistAr'. Process is not associated with a repository."
RapidMiner.init();
try
{
RepositoryLocation pLoc = new RepositoryLocation("//NewLocalRepository/TestNews");
ProcessEntry pEntry = (ProcessEntry) pLoc.locateEntry();
String processXML = pEntry.retrieveXML();
Process process = new Process(processXML);
// Load learned model
final RepositoryLocation locWordList = new RepositoryLocation(
"//NewLocalRepository/modelAr");
final IOObject wordlist = ((IOObjectEntry)
locWordList.locateEntry()).retrieveData(null);
// Load Wordlist
final RepositoryLocation locModel = new RepositoryLocation(
"//NewLocalRepository/wordlistAr");
final IOObject model = ((IOObjectEntry)
locModel.locateEntry()).retrieveData(null);
final IOContainer ioInput = new IOContainer(new IOObject[] { wordlist, model });
final IOContainer ioResult = process.run(ioInput);
// Print some results
final SimpleExampleSet ses = ioResult.get(SimpleExampleSet.class);
for (int i = 0; i < Math.min(5, ses.size()); i++) {
final Example example = ses.getExample(i);
final Attributes attributes = example.getAttributes();
final String id = example.getValueAsString(attributes.getId());
final String prediction = example.getValueAsString(
attributes.getPredictedLabel());
System.out.println("Path: " + id + ":\tPrediction:" + prediction);
}
}
catch(Exception e)
{e.printStackTrace();}
}
Thank you very much
Hi,
your "Retrieve" operators do not specify a repository location, but instead an absolute path on the file system. That is not what these operators are for, they only work with repositories. If your process and your data is located in the same folder in the same repository, you can simply change the repository entry value to "wordlistAr" and "modelAr". They will then be searched right next to the process.
Also you are giving your process input data. That is not necessary as you have not connected the input ports on the left side of the process. That's where the input data would appear. If loading data via operators, no input data is needed.
Regards,
Marco
your "Retrieve" operators do not specify a repository location, but instead an absolute path on the file system. That is not what these operators are for, they only work with repositories. If your process and your data is located in the same folder in the same repository, you can simply change the repository entry value to "wordlistAr" and "modelAr". They will then be searched right next to the process.
Also you are giving your process input data. That is not necessary as you have not connected the input ports on the left side of the process. That's where the input data would appear. If loading data via operators, no input data is needed.
Regards,
Marco
I changed the repository entry value in the XML for both retrieves as you said, and modify the code to not take input

Why it's "Process is not associated with a repository."? I did associate it with a repository!
Thank you
RapidMiner.setExecutionMode(ExecutionMode.COMMAND_LINE);But I still get the same error though the process and the data are in the same repository folder
RapidMiner.init();
try
{
RepositoryLocation pLoc = new RepositoryLocation("//NewLocalRepository/TestNews");
ProcessEntry pEntry = (ProcessEntry) pLoc.locateEntry();
String processXML = pEntry.retrieveXML();
Process process = new Process(processXML);
final IOContainer ioResult = process.run();
// Print some results
final SimpleExampleSet ses = ioResult.get(SimpleExampleSet.class);
for (int i = 0; i < Math.min(5, ses.size()); i++) {
final Example example = ses.getExample(i);
final Attributes attributes = example.getAttributes();
final String id = example.getValueAsString(attributes.getId());
final String prediction = example.getValueAsString(
attributes.getPredictedLabel());
System.out.println("Path: " + id + ":\tPrediction:" + prediction);
}
}
catch(Exception e)
{e.printStackTrace();}
}

Why it's "Process is not associated with a repository."? I did associate it with a repository!
Thank you
Hi,
I just noticed you're loading your process but don't set it's location. You basically take the XML and build a process from that - which is fine, but that process now knows nothing about where it originally came from. To fix that, add one line after your process creation:
Marco
I just noticed you're loading your process but don't set it's location. You basically take the XML and build a process from that - which is fine, but that process now knows nothing about where it originally came from. To fix that, add one line after your process creation:
Regards,
Process process = new Process(processXML);
process.setProcessLocation(new RepositoryProcessLocation(pLoc));
Marco
Thank you very much it does work now, and I got the results 
But there's one critical problem, I got wrong classification predictions. I calculated the accuracy of prediction , when I run it in RapidMinrer GUI it's about 80% , however when I run the same process in Java it sharply drops down to about 11% ??? Though both of them are testing the same dataset.
Also, I got exactly the same predictions every time I run the process in Java.
Another question please, I'm planning to integrate the process with an Android application. I know it's not efficient, but I need it as a temporary solution.
Anyway, I want to take user input(String) and give it to process as an input instead of reading from files in the computer. Is there such a thing in RapidMiner? How can I do that?
Thanks a lot

But there's one critical problem, I got wrong classification predictions. I calculated the accuracy of prediction , when I run it in RapidMinrer GUI it's about 80% , however when I run the same process in Java it sharply drops down to about 11% ??? Though both of them are testing the same dataset.
Also, I got exactly the same predictions every time I run the process in Java.
Another question please, I'm planning to integrate the process with an Android application. I know it's not efficient, but I need it as a temporary solution.
Anyway, I want to take user input(String) and give it to process as an input instead of reading from files in the computer. Is there such a thing in RapidMiner? How can I do that?
Thanks a lot
Hi,
1)
- are you using the same version of the Text extension in both GUI mode and your code?
- is the random seed for the process identical in both GUI mode and your code?
2) I think you want to create a document from the user input? If so, probably the easiest way is to use a macro. Replace the "Process Documents from Files" operator with a "Create Document" operator which delivers its data to a "Process Documents" operator. Before executing the process, set the macro like so:
For the process itself, see below:
Marco
1)
- are you using the same version of the Text extension in both GUI mode and your code?
- is the random seed for the process identical in both GUI mode and your code?
2) I think you want to create a document from the user input? If so, probably the easiest way is to use a macro. Replace the "Process Documents from Files" operator with a "Create Document" operator which delivers its data to a "Process Documents" operator. Before executing the process, set the macro like so:
process.getMacroHandler().addMacro("user_input", "yourUserData");
For the process itself, see below:
Regards,
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="6.4.000" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="C:\Users\WINDOWS 7\.RapidMiner5\repositories\NewLocalRepository\wordlistAr"/>
</operator>
<operator activated="true" class="retrieve" compatibility="6.4.000" expanded="true" height="60" name="Retrieve (2)" width="90" x="246" y="30">
<parameter key="repository_entry" value="C:\Users\WINDOWS 7\.RapidMiner5\repositories\NewLocalRepository\modelAr"/>
</operator>
<operator activated="true" class="text:create_document" compatibility="6.4.000" expanded="true" height="60" name="Create Document" width="90" x="45" y="120">
<parameter key="text" value="%{user_input}"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="6.4.000" expanded="true" height="94" name="Process Documents" width="90" x="246" y="120">
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="6.4.000" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
<operator activated="true" class="text:filter_stopwords_arabic" compatibility="6.4.000" expanded="true" height="60" name="Filter Stopwords (Arabic)" width="90" x="179" y="30"/>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="6.4.000" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="313" y="30">
<parameter key="max_length" value="1"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Arabic)" to_port="document"/>
<connect from_op="Filter Stopwords (Arabic)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="apply_model" compatibility="6.4.000" expanded="true" height="76" name="Apply Model" width="90" x="380" y="30">
<list key="application_parameters"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Process Documents" to_port="word list"/>
<connect from_op="Retrieve (2)" from_port="output" to_op="Apply Model" to_port="model"/>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Marco
Hi,
yes, the lib/plugin folder is correct. You can call
Regards,
Marco
yes, the lib/plugin folder is correct. You can call
to check the random seed of the process and compare it with the random seed in the GUI. If that is not the cause, you can send me the data your process uses via PM, and I will have a look what's going on.
try {
System.out.println(myProcess.getRootOperator().getParameter(ProcessRootOperator.PARAMETER_RANDOM_SEED));
} catch (UndefinedParameterError e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Regards,
Marco
Hi,
I just had a quick glance at what you sent me. I seem to get identical results in GUI and Java execution mode. However I do not have the time to dig deeper into it.
If you do have support access with your Studio license, please contact us at https://support.rapidminer.com/ and we will investigate the issue further.
Otherwise, if you are certain it is a bug on our end, you can file a bug at http://bugs.rapidminer.com/.
Regards,
Marco
I just had a quick glance at what you sent me. I seem to get identical results in GUI and Java execution mode. However I do not have the time to dig deeper into it.
If you do have support access with your Studio license, please contact us at https://support.rapidminer.com/ and we will investigate the issue further.
Otherwise, if you are certain it is a bug on our end, you can file a bug at http://bugs.rapidminer.com/.
Regards,
Marco
Hi,
when building your extension, you have to specify that you depend on the Text Extension. With our new Gradle extension mechanism, this looks like this:
build.gradle
Sort of a dev kit is scheduled to be released within the next 30 days, so developing an extension for Studio 6.x should be significantly easier soon.
Regards,
Marco
when building your extension, you have to specify that you depend on the Text Extension. With our new Gradle extension mechanism, this looks like this:
build.gradle
The dependency is there so a) the dependency is downloaded when downloading your extension from the marketplace and b) so you have access to its code at compile time.
[...]
extensionConfig {
name 'My Extension'
namespace 'my_ext'
dependencies {
extension namespace: 'text', version: '5.3.002'
}
}
[...]
Sort of a dev kit is scheduled to be released within the next 30 days, so developing an extension for Studio 6.x should be significantly easier soon.
Regards,
Marco
Hi!
Finally the problem is solved.
The whole issue was with the encoding of the testing text files. It's supposed to be UTF-8, because it's in Arabic, However it was ANSI.
If you don't mind, please I just need an explanation regarding the behavior of the following classifiers. I tested 3 classifiers on the same testing data but used 3 different datasets for training, and the accuracy results are as the following:
- Naive Bayes:
Testing accuracy using Data1 for training : 49%
Testing accuracy using Data2 for training: 76%
Testing accuracy using Data3 for training: 89%
- SVM:
Testing accuracy using Data1 for training: 50%
Testing accuracy using Data2 for training: 25%
Testing accuracy using Data3 for training: 72%
- K-NN
Testing accuracy using Data1 for training: 22%
Testing accuracy using Data2 for training: 95%
Testing accuracy using Data3 for training: 47%
Note: Data1(500 files,short texts), Data2(500 files, long texts) , Data3 = (1615 files, long and short texts)
The expected result was that the testing accuracy will keep increasing from Data1 to Data3. However, this is observed only with Naive Bayes, while the other 2 are showing ups and downs in the percentages.
Thank you,
Duha
Finally the problem is solved.
The whole issue was with the encoding of the testing text files. It's supposed to be UTF-8, because it's in Arabic, However it was ANSI.
If you don't mind, please I just need an explanation regarding the behavior of the following classifiers. I tested 3 classifiers on the same testing data but used 3 different datasets for training, and the accuracy results are as the following:
- Naive Bayes:
Testing accuracy using Data1 for training : 49%
Testing accuracy using Data2 for training: 76%
Testing accuracy using Data3 for training: 89%
- SVM:
Testing accuracy using Data1 for training: 50%
Testing accuracy using Data2 for training: 25%
Testing accuracy using Data3 for training: 72%
- K-NN
Testing accuracy using Data1 for training: 22%
Testing accuracy using Data2 for training: 95%
Testing accuracy using Data3 for training: 47%
Note: Data1(500 files,short texts), Data2(500 files, long texts) , Data3 = (1615 files, long and short texts)
The expected result was that the testing accuracy will keep increasing from Data1 to Data3. However, this is observed only with Naive Bayes, while the other 2 are showing ups and downs in the percentages.
Thank you,
Duha
Hi,
you are partly right and partly wrong. Usually it is better to add more examples (=texts) to the trainining. The idea is, that it is better to add more information.
In text mining you will get more attributes the more examples you add. There are most likely a lot of attributes w/o any information about the label. In this case the learner might get confused.
If you think about the k-NN you can easily imagine that. If you add more dimensions, which are just uniformly distributed, the distance measure will get heavily influenced by those attributes and the k-NN will get confused. For the SVM i would expect, that you need a higher C to get the same results.
You should try to do a feature selection. I would suggest using Weight by SVM with Select by Weights and then train the algorihm afterwards. Also pruning (in Process Documents) might help.
Cheers,
Martin
you are partly right and partly wrong. Usually it is better to add more examples (=texts) to the trainining. The idea is, that it is better to add more information.
In text mining you will get more attributes the more examples you add. There are most likely a lot of attributes w/o any information about the label. In this case the learner might get confused.
If you think about the k-NN you can easily imagine that. If you add more dimensions, which are just uniformly distributed, the distance measure will get heavily influenced by those attributes and the k-NN will get confused. For the SVM i would expect, that you need a higher C to get the same results.
You should try to do a feature selection. I would suggest using Weight by SVM with Select by Weights and then train the algorihm afterwards. Also pruning (in Process Documents) might help.
Cheers,
Martin
Hi Duha,
did you already manage to integrate RapidMiner with Android? Whenever I try to do: RapidMiner.setExecutionMode(RapidMiner.ExecutionMode.APPSERVER); the app already crashes.
@RM: Apart from the file system access I would not see many obstacles to run RM on Andoid - will there be any support for this or would you recommend another way to perform the Model Apply task?
Best
John
did you already manage to integrate RapidMiner with Android? Whenever I try to do: RapidMiner.setExecutionMode(RapidMiner.ExecutionMode.APPSERVER); the app already crashes.
@RM: Apart from the file system access I would not see many obstacles to run RM on Andoid - will there be any support for this or would you recommend another way to perform the Model Apply task?
Best
John
when you debug your code you will realize that is null. The reason is that you're trying to load a repository location which most likely does not exist. Check the path and make sure it matches 100% with the one you see in the RapidMiner Studio GUI. If I should hazard a guess I'd say it is more likely that the correct path is "//NewLocalRepository/modelAr".
Unrelated: Why do you run the process 3 times in a row, discarding the results of the first 2 executions?
Regards,
Marco