"FeatureExtraction from XML LibSVM Java"
jorno
New Altair Community Member
Hi All,
First of all i would like to thanks the Rapid Miner guys for their great product !
Thanks a lot for the examples , documentation and of course, the wizards !!
I would also like to thank Michael Wurst for his tutorial on his website (nemoz.org) !!!
----------------------------
I'm a newbie student and i have assignment to classify urls.
I read a lot of documentation and searched in the forums , but i guess i still have 2 problems ( RapidMiner version 4.2 ) ...
I created an XML file for each url features in 2 folders.
.\train\news\www.news1.de.xml
.\train\news\www.news2.de.xml
.\train\porn\www.porn1.de.xml
.\train\porn\www.porn2.de.xml
each xml looks like:
<myXML>
<title> my title </title>
<keywords> my keywords </keywords>
<numberOfPages> 6 </numberOfPages>
</myXML>
----------------------------
1. when i am running the project file ( below ) in RapidMiner - with libsvm - it says :
"Message: This learning scheme does not have sufficient capabilities for the given data set: polynominal attributes not supported"
I tried to use the "06_ExtractionAndWordVecotor.xml" example - but it gave me the same error.
2. I tried to load the model using java - but i cannot understand how to load the features themselves instead of the whole text ...
( TextInput instead of SingleTextInput ?? ) , the simple example works - but without the features ...
I would really appreciate your help !
Thanks a lot for everything !!
Jorno
---------------------------------------------
RAPID MINER CONFIGURATION FILE
---------------------------------------------
<?xml version="1.0" encoding="windows-1252"?>
<process version="4.4">
<operator name="Root" class="Process" expanded="yes">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="encoding" value="SYSTEM"/>
<operator name="Extractor" class="FeatureExtraction">
<list key="texts">
<parameter key="news" value=".\train\news"/>
<parameter key="porn" value=".\train\porn"/>
</list>
<parameter key="default_content_type" value=""/>
<parameter key="default_content_encoding" value="UTF-8"/>
<parameter key="default_content_language" value="english"/>
<parameter key="use_content_attributes" value="false"/>
<parameter key="id_attribute_type" value="long"/>
<list key="attributes">
<parameter key="title" value="//*/title/text() "/>
<parameter key="#numberOfPages" value="//*/numberOfPages/text()"/>
<parameter key="keywords" value="//*/keywords/text()"/>
</list>
<list key="namespaces">
</list>
</operator>
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="news" value=".\train\news"/>
<parameter key="porn" value=".\train\porn"/>
</list>
<parameter key="default_content_type" value=""/>
<parameter key="default_content_encoding" value="UTF-8"/>
<parameter key="default_content_language" value="english"/>
<parameter key="prune_below" value="-1"/>
<parameter key="prune_above" value="-1"/>
<parameter key="vector_creation" value="TFIDF"/>
<parameter key="use_content_attributes" value="false"/>
<parameter key="use_given_word_list" value="false"/>
<parameter key="return_word_list" value="true"/>
<parameter key="output_word_list" value=".\train\training_words.txt"/>
<parameter key="id_attribute_type" value="long"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="true"/>
<parameter key="on_the_fly_pruning" value="-1"/>
<parameter key="extend_exampleset" value="true"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="2147483647"/>
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="LibSVMLearner" class="LibSVMLearner">
<parameter key="keep_example_set" value="false"/>
<parameter key="svm_type" value="C-SVC"/>
<parameter key="kernel_type" value="linear"/>
<parameter key="degree" value="3"/>
<parameter key="gamma" value="0.0"/>
<parameter key="coef0" value="0.0"/>
<parameter key="C" value="0.0"/>
<parameter key="nu" value="0.5"/>
<parameter key="cache_size" value="80"/>
<parameter key="epsilon" value="0.0010"/>
<parameter key="p" value="0.1"/>
<list key="class_weights">
</list>
<parameter key="shrinking" value="true"/>
<parameter key="calculate_confidences" value="false"/>
<parameter key="confidence_for_multiclass" value="true"/>
</operator>
<operator name="ModelWriter" class="ModelWriter">
<parameter key="model_file" value=".\train\training_model.mod"/>
<parameter key="overwrite_existing_file" value="true"/>
<parameter key="output_type" value="Binary"/>
</operator>
</operator>
</process>
-----------------------------------------------
JAVA CODE
-----------------------------------------------
import java.io.File;
import java.io.IOException;
import com.rapidminer.RapidMiner;
import com.rapidminer.example.Example;
import com.rapidminer.example.ExampleSet;
import com.rapidminer.operator.IOContainer;
import com.rapidminer.operator.Model;
import com.rapidminer.operator.Operator;
import com.rapidminer.operator.OperatorChain;
import com.rapidminer.operator.OperatorCreationException;
import com.rapidminer.operator.OperatorException;
import com.rapidminer.tools.OperatorService;
public class RapidMinerTextClassifier
{
private OperatorChain wvtoolOperator;
private Operator modelApplier;
private Model model;
public RapidMinerTextClassifier(File modelFile, File wordListFile)
throws IOException, OperatorCreationException, OperatorException
{
//System.setProperty(RapidMiner.PROPERTY_RAPIDMINER_HOME, "C:\\Program Files\\Rapid-I\\RapidMiner\\lib"); // "rapidminer.home"
//System.setProperty("rapidminer.home", "D:\\Applications\\RapidMiner-4.2");
System.setProperty("rapidminer.home", "C:\\Program Files\\Rapid-I\\RapidMiner");
String pluginDirString = new File("C:\\Program Files\\Rapid-I\\RapidMiner\\lib\\plugins").getAbsolutePath();
System.setProperty(RapidMiner.PROPERTY_RAPIDMINER_INIT_PLUGINS_LOCATION, pluginDirString);
RapidMiner.init(false, false, false, true);
// Create the text input operator and set the path to the word list you stored using Rapid Miner
// As there is only a single text, we use the SingleTextInput operator
wvtoolOperator = (OperatorChain) OperatorService.createOperator("SingleTextInput"); // I need TextInput ?????????????
wvtoolOperator.setParameter("input_word_list", wordListFile.getAbsolutePath());
// Add additional processing steps.
// Note the setup must be same as the one you used when creating the classification model
wvtoolOperator.addOperator(OperatorService.createOperator("StringTokenizer"));
wvtoolOperator.addOperator(OperatorService.createOperator("EnglishStopwordFilter"));
wvtoolOperator.addOperator(OperatorService.createOperator("TokenLengthFilter"));
wvtoolOperator.addOperator(OperatorService.createOperator("PorterStemmer"));
// Create the model applier
modelApplier = OperatorService.createOperator("ModelApplier");
// Load the model into a field of the class
Operator modelLoader = OperatorService.createOperator("ModelLoader");
modelLoader.setParameter("model_file", modelFile.getAbsolutePath());
IOContainer container = modelLoader.apply(new IOContainer());
model = container.get(Model.class);
}
public String apply(String text) throws OperatorException
{
// Set the text
wvtoolOperator.setParameter("text", text);
//wvtoolOperator.setParameter("title", text);
//wvtoolOperator.setParameter("keywords", text);
//wvtoolOperator.setParameter("numberOfPages", int);
// Call the text input operator
IOContainer container = wvtoolOperator.apply(new IOContainer(model));
// Call the model applier (the model was added already before calling the text input)
container = modelApplier.apply(container);
// Obtain the example set from the io container. It contains only a single example with our text in it.
ExampleSet eset = container.get(ExampleSet.class);
Example e = eset.iterator().next();
// Compare the predicted label with the positive label
System.out.println(eset.getAttributes().getPredictedLabel().getMapping() + " " + e.getConfidence("porn") + " " + e.getConfidence("news"));
return eset.getAttributes().getPredictedLabel().getMapping().mapIndex( (int)e.getPredictedLabel() );
}
public static void main(String args[]) throws Exception
{
// Create a text classifier
RapidMinerTextClassifier tr = new RapidMinerTextClassifier(
new File(
"C:\\Main\\eclipse\\workspace\\octopus\\RapidMiner\\train\\training_model.mod"),
new File(
"C:\\Main\\eclipse\\workspace\\octopus\\RapidMiner\\train\\training_words.txt"));
// Call the classifier with texts
System.out.println("Test1:" + tr.apply("povrai xflick resolution gif"));
System.out.println("Test2:" + tr.apply("workstation intel switch"));
System.out.println("Test3:" + tr.apply("sex porn sex povrai xflick resolution gif"));
}
}
First of all i would like to thanks the Rapid Miner guys for their great product !
Thanks a lot for the examples , documentation and of course, the wizards !!
I would also like to thank Michael Wurst for his tutorial on his website (nemoz.org) !!!
----------------------------
I'm a newbie student and i have assignment to classify urls.
I read a lot of documentation and searched in the forums , but i guess i still have 2 problems ( RapidMiner version 4.2 ) ...
I created an XML file for each url features in 2 folders.
.\train\news\www.news1.de.xml
.\train\news\www.news2.de.xml
.\train\porn\www.porn1.de.xml
.\train\porn\www.porn2.de.xml
each xml looks like:
<myXML>
<title> my title </title>
<keywords> my keywords </keywords>
<numberOfPages> 6 </numberOfPages>
</myXML>
----------------------------
1. when i am running the project file ( below ) in RapidMiner - with libsvm - it says :
"Message: This learning scheme does not have sufficient capabilities for the given data set: polynominal attributes not supported"
I tried to use the "06_ExtractionAndWordVecotor.xml" example - but it gave me the same error.
2. I tried to load the model using java - but i cannot understand how to load the features themselves instead of the whole text ...
( TextInput instead of SingleTextInput ?? ) , the simple example works - but without the features ...
I would really appreciate your help !
Thanks a lot for everything !!
Jorno
---------------------------------------------
RAPID MINER CONFIGURATION FILE
---------------------------------------------
<?xml version="1.0" encoding="windows-1252"?>
<process version="4.4">
<operator name="Root" class="Process" expanded="yes">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="encoding" value="SYSTEM"/>
<operator name="Extractor" class="FeatureExtraction">
<list key="texts">
<parameter key="news" value=".\train\news"/>
<parameter key="porn" value=".\train\porn"/>
</list>
<parameter key="default_content_type" value=""/>
<parameter key="default_content_encoding" value="UTF-8"/>
<parameter key="default_content_language" value="english"/>
<parameter key="use_content_attributes" value="false"/>
<parameter key="id_attribute_type" value="long"/>
<list key="attributes">
<parameter key="title" value="//*/title/text() "/>
<parameter key="#numberOfPages" value="//*/numberOfPages/text()"/>
<parameter key="keywords" value="//*/keywords/text()"/>
</list>
<list key="namespaces">
</list>
</operator>
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="news" value=".\train\news"/>
<parameter key="porn" value=".\train\porn"/>
</list>
<parameter key="default_content_type" value=""/>
<parameter key="default_content_encoding" value="UTF-8"/>
<parameter key="default_content_language" value="english"/>
<parameter key="prune_below" value="-1"/>
<parameter key="prune_above" value="-1"/>
<parameter key="vector_creation" value="TFIDF"/>
<parameter key="use_content_attributes" value="false"/>
<parameter key="use_given_word_list" value="false"/>
<parameter key="return_word_list" value="true"/>
<parameter key="output_word_list" value=".\train\training_words.txt"/>
<parameter key="id_attribute_type" value="long"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="true"/>
<parameter key="on_the_fly_pruning" value="-1"/>
<parameter key="extend_exampleset" value="true"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="2147483647"/>
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="LibSVMLearner" class="LibSVMLearner">
<parameter key="keep_example_set" value="false"/>
<parameter key="svm_type" value="C-SVC"/>
<parameter key="kernel_type" value="linear"/>
<parameter key="degree" value="3"/>
<parameter key="gamma" value="0.0"/>
<parameter key="coef0" value="0.0"/>
<parameter key="C" value="0.0"/>
<parameter key="nu" value="0.5"/>
<parameter key="cache_size" value="80"/>
<parameter key="epsilon" value="0.0010"/>
<parameter key="p" value="0.1"/>
<list key="class_weights">
</list>
<parameter key="shrinking" value="true"/>
<parameter key="calculate_confidences" value="false"/>
<parameter key="confidence_for_multiclass" value="true"/>
</operator>
<operator name="ModelWriter" class="ModelWriter">
<parameter key="model_file" value=".\train\training_model.mod"/>
<parameter key="overwrite_existing_file" value="true"/>
<parameter key="output_type" value="Binary"/>
</operator>
</operator>
</process>
-----------------------------------------------
JAVA CODE
-----------------------------------------------
import java.io.File;
import java.io.IOException;
import com.rapidminer.RapidMiner;
import com.rapidminer.example.Example;
import com.rapidminer.example.ExampleSet;
import com.rapidminer.operator.IOContainer;
import com.rapidminer.operator.Model;
import com.rapidminer.operator.Operator;
import com.rapidminer.operator.OperatorChain;
import com.rapidminer.operator.OperatorCreationException;
import com.rapidminer.operator.OperatorException;
import com.rapidminer.tools.OperatorService;
public class RapidMinerTextClassifier
{
private OperatorChain wvtoolOperator;
private Operator modelApplier;
private Model model;
public RapidMinerTextClassifier(File modelFile, File wordListFile)
throws IOException, OperatorCreationException, OperatorException
{
//System.setProperty(RapidMiner.PROPERTY_RAPIDMINER_HOME, "C:\\Program Files\\Rapid-I\\RapidMiner\\lib"); // "rapidminer.home"
//System.setProperty("rapidminer.home", "D:\\Applications\\RapidMiner-4.2");
System.setProperty("rapidminer.home", "C:\\Program Files\\Rapid-I\\RapidMiner");
String pluginDirString = new File("C:\\Program Files\\Rapid-I\\RapidMiner\\lib\\plugins").getAbsolutePath();
System.setProperty(RapidMiner.PROPERTY_RAPIDMINER_INIT_PLUGINS_LOCATION, pluginDirString);
RapidMiner.init(false, false, false, true);
// Create the text input operator and set the path to the word list you stored using Rapid Miner
// As there is only a single text, we use the SingleTextInput operator
wvtoolOperator = (OperatorChain) OperatorService.createOperator("SingleTextInput"); // I need TextInput ?????????????
wvtoolOperator.setParameter("input_word_list", wordListFile.getAbsolutePath());
// Add additional processing steps.
// Note the setup must be same as the one you used when creating the classification model
wvtoolOperator.addOperator(OperatorService.createOperator("StringTokenizer"));
wvtoolOperator.addOperator(OperatorService.createOperator("EnglishStopwordFilter"));
wvtoolOperator.addOperator(OperatorService.createOperator("TokenLengthFilter"));
wvtoolOperator.addOperator(OperatorService.createOperator("PorterStemmer"));
// Create the model applier
modelApplier = OperatorService.createOperator("ModelApplier");
// Load the model into a field of the class
Operator modelLoader = OperatorService.createOperator("ModelLoader");
modelLoader.setParameter("model_file", modelFile.getAbsolutePath());
IOContainer container = modelLoader.apply(new IOContainer());
model = container.get(Model.class);
}
public String apply(String text) throws OperatorException
{
// Set the text
wvtoolOperator.setParameter("text", text);
//wvtoolOperator.setParameter("title", text);
//wvtoolOperator.setParameter("keywords", text);
//wvtoolOperator.setParameter("numberOfPages", int);
// Call the text input operator
IOContainer container = wvtoolOperator.apply(new IOContainer(model));
// Call the model applier (the model was added already before calling the text input)
container = modelApplier.apply(container);
// Obtain the example set from the io container. It contains only a single example with our text in it.
ExampleSet eset = container.get(ExampleSet.class);
Example e = eset.iterator().next();
// Compare the predicted label with the positive label
System.out.println(eset.getAttributes().getPredictedLabel().getMapping() + " " + e.getConfidence("porn") + " " + e.getConfidence("news"));
return eset.getAttributes().getPredictedLabel().getMapping().mapIndex( (int)e.getPredictedLabel() );
}
public static void main(String args[]) throws Exception
{
// Create a text classifier
RapidMinerTextClassifier tr = new RapidMinerTextClassifier(
new File(
"C:\\Main\\eclipse\\workspace\\octopus\\RapidMiner\\train\\training_model.mod"),
new File(
"C:\\Main\\eclipse\\workspace\\octopus\\RapidMiner\\train\\training_words.txt"));
// Call the classifier with texts
System.out.println("Test1:" + tr.apply("povrai xflick resolution gif"));
System.out.println("Test2:" + tr.apply("workstation intel switch"));
System.out.println("Test3:" + tr.apply("sex porn sex povrai xflick resolution gif"));
}
}
0
Answers
-
Hi,
first of all: Please update to the current version 4.4. I not even remember, which problems occured back then...
And now to your problem: You are trying to load structured information from a xml file, but you are using one of the textinput operators, which are just designed for unstructured (plain) texts. You have two possibilities: Generate comma separated files from your xml files, and use the normal exampleSource. For example, this file could look like this:
news, my title,my keywords, 6
news, my title2,my keywords2, 4
...
Another, perhabs more easy method for extracting data from structured files is the FeatureExtractionOperator of the text plugin. You can specifiy there XPath expressions, in order to extract the content of each of your three XML nodes. Each expression is assigned another attribute. But then you would have to do this in two steps for generating the correct label, because its not inside the XML and hence cannot be extracted...
Greetings,
Sebastian0 -
Thanks a lot for your reply.
I am sorry - i think i am really a newbie - because i didn't understand.
1. as you see in my configuration file - i used the FeatureExtraction and the xpath like u said ( I am using version 4.4 ) . I really a newbie - and i will be more than grateful if you could please help me to understand what operators/parameters i need to change in order for the Model to run.
2. the Java code is a different question ... how do i add features to the code ?
Thanks a lot and sorry for the troubles ,
Thanks again
Jorno.0 -
Hi Jorno,
don't worry about that. It's a very complex field, nobody understands at once.
For your first point: You don't need the text input at all. There isn't any plain text to load! All you want is to import the information stored in your xml file. So remove the TextInput Operator.
One hint: This is a complex setup for the beginning. Try to separate it in substeps: First only load the data, so that all your features are stored as attributes of the appropriate type. Then try to learn anything, and finally do a validation. Do one step after the other...
Greetings,
Sebastian0 -
thanks a lot for all your help Sebastian !!!
I spent over a week days/nights on this from your last answers and i think i learned a lot ... ,
and I think it is working now ..
what i did ( xml attached ):
1. i added the AttributeSubsetPreprocessing/Nominal2String for all the xpath attributes.
2. used the StringTextInput ( because I needed the stemmer etc.) with remove_original_attributes=yes.
the problem is that i think that it takes the whole features Strings as one bulk/chunk of Strings and not as different strings for each feature
( e.g. different strings weights for strings in the "title" and different strings weights for strings in "description" ... )
meaning : i think that the "title"/"keywords" features should influence more than "parseText"(all page text) feature... but i don't see it in the model ...
Am i right ? How do i do it ?
Thanks again !
Jorno
<operator name="Root" class="Process" expanded="yes">
<description text="Octopus"/>
<operator name="Extractor" class="FeatureExtraction">
<list key="texts">
<parameter key="news" value=".\train\news"/>
<parameter key="porn" value=".\train\porn"/>
</list>
<parameter key="default_content_encoding" value="UTF-8"/>
<parameter key="default_content_language" value="english"/>
<list key="attributes">
<parameter key="title" value="//*/title/text() "/>
<parameter key="#redirectCount" value="//*/redirectCount/text()"/>
<parameter key="description" value="//*/description/text()"/>
<parameter key="keywords" value="//*/keywords/text()"/>
<parameter key="parseText" value="//*/parseText/text()"/>
<parameter key="metaAbstract" value="//*/metaAbstract/text()"/>
</list>
<list key="namespaces">
</list>
</operator>
<operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="yes">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="attribute_name_regex" value="title|description|keywords|parseText|metaAbstract"/>
<operator name="Nominal2String" class="Nominal2String">
</operator>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="remove_original_attributes" value="true"/>
<parameter key="return_word_list" value="true"/>
<parameter key="output_word_list" value="OctopusWordList.txt"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="LibSVMLearner" class="LibSVMLearner">
<parameter key="keep_example_set" value="true"/>
<parameter key="kernel_type" value="linear"/>
<list key="class_weights">
</list>
<parameter key="calculate_confidences" value="true"/>
</operator>
<operator name="ModelWriter" class="ModelWriter">
<parameter key="model_file" value="OctopusModel.mod"/>
<parameter key="output_type" value="Binary"/>
</operator>
</operator>0 -
Hi Jorno,
nice to hear that you got it! And learning can never be bad
Back to topic I have to say that your assumption is correct: The StringTextInput treats all String attribute as one. The only thing you could come arround this is to iterate over all the NominalValues which should be converted into strings, convert them one after the other. Inside this iteration, you would then have to use the StringTextInput and afterwards rename all new features, which have been generated by the TextInput. For example if you have a StringAttribute "title" then you could rename all word attributes into "title_word".
You will probably have to make yourself familiar with the FeatureIteration operator, regular expressions in general and the ChangeAttributeNamesReplace.
If you get this to work, you might call you an experienced rapidMiner user
Greetings,
Sebastian0 -
Hi Sabastian, thanks again !
I guess that I am not an experienced rapidMiner user ( although i read so many documentation , forums etc. ) ...
i tried to build the project as u said but i see 3 weird issues
1. after each iteration most of the features are changing their names and not all of them ?!?
although i replaced all the features that don't contains the "feature_" string to "feature_<loop_feature>" using the "^[^feature_].*$" regex ...
2. after the whole FeatureIterator iterations - i am not getting the manipulate exampleSet - but the original exampleSet with the nominal values ?!?!?
( and i tried to played with the work_on_input parameter with no success ...)
3. I also wondered how it can create the "output_word_list" for all the attributes ...
I am so desperate ...
and to think that afterward i will also need to call the model from my java code
thanks u so much for your help !!!
jorno
<operator name="Root" class="Process" expanded="yes">
<operator name="Extractor" class="FeatureExtraction">
<list key="texts">
<parameter key="news" value=".\train\news"/>
<parameter key="porn" value=".\train\porn"/>
</list>
<parameter key="default_content_encoding" value="UTF-8"/>
<parameter key="default_content_language" value="english"/>
<list key="attributes">
<parameter key="feature_title" value="//*/title/text() "/>
<parameter key="#feature_redirectCount" value="//*/redirectCount/text()"/>
<parameter key="feature_description" value="//*/description/text()"/>
<parameter key="feature_keywords" value="//*/keywords/text()"/>
<parameter key="feature_parseText" value="//*/parseText/text()"/>
<parameter key="feature_metaAbstract" value="//*/metaAbstract/text()"/>
</list>
<list key="namespaces">
</list>
</operator>
<operator name="FeatureIterator" class="FeatureIterator" expanded="yes">
<parameter key="type_filter" value="nominal"/>
<operator name="Nominal2String on current attribute only" class="AttributeSubsetPreprocessing" expanded="yes">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="attribute_name_regex" value="%{loop_feature}"/>
<parameter key="deliver_inner_results" value="true"/>
<operator name="Nominal2String (2)" class="Nominal2String">
</operator>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="remove_original_attributes" value="true"/>
<parameter key="return_word_list" value="true"/>
<parameter key="output_word_list" value="C:\Main\eclipse\workspace\octopus\RapidMiner\OctopusWordList.txt"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="true"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
<operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="ChangeAttributeNamesReplace" class="ChangeAttributeNamesReplace">
<parameter key="attributes" value="^[^feature_].*$"/>
<parameter key="replace_what" value="^"/>
<parameter key="replace_by" value="%{loop_feature}_"/>
<parameter key="apply_on_special" value="false"/>
</operator>
</operator>
<operator name="LibSVMLearner" class="LibSVMLearner">
<parameter key="keep_example_set" value="true"/>
<parameter key="kernel_type" value="linear"/>
<list key="class_weights">
</list>
<parameter key="calculate_confidences" value="true"/>
</operator>
<operator name="ModelWriter" class="ModelWriter">
<parameter key="model_file" value="OctopusModel.mod"/>
<parameter key="output_type" value="Binary"/>
</operator>
</operator>0 -
Hi again,
I didn't thought of this behavior. Hm. It's less elegant, but I will post a process below which shows a way around...<operator name="Root" class="Process" expanded="yes">
By the way: If you rename the attributes, you have to rename it into something including the source attribute. Otherwise the attributes again might have the same name...
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="target_function" value="sum"/>
</operator>
<operator name="IOStorer" class="IOStorer">
<parameter key="name" value="es"/>
<parameter key="io_object" value="ExampleSet"/>
<parameter key="remove_from_process" value="false"/>
</operator>
<operator name="FeatureIterator" class="FeatureIterator" expanded="yes">
<parameter key="filter" value=".*"/>
<operator name="IOConsumer" class="IOConsumer">
<parameter key="io_object" value="ExampleSet"/>
</operator>
<operator name="IORetriever" class="IORetriever">
<parameter key="name" value="es"/>
<parameter key="io_object" value="ExampleSet"/>
</operator>
<operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" breakpoints="after" expanded="yes">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="attribute_name_regex" value="%{loop_feature}"/>
<operator name="BinDiscretization" class="BinDiscretization">
</operator>
</operator>
<operator name="IOStorer (2)" class="IOStorer">
<parameter key="name" value="es"/>
<parameter key="io_object" value="ExampleSet"/>
</operator>
</operator>
<operator name="IORetriever (2)" class="IORetriever">
<parameter key="name" value="es"/>
<parameter key="io_object" value="ExampleSet"/>
</operator>
</operator>
The output word list is now only one part of the needed preprocessing: You might easily save it using the option in the StringTextInput operator, remember to include the source attribute name into the file, otherwise you will overwrite it...
During applying you have to use the appropriate list for the given source attribute and afterwards do the renaming again...You might want to put the renaming into one special process and calling it during training and applying with the ProcessEmbedder, which makes the process to some sort of preprocessing model...
Greetings,
Sebastian0 -
Hi Sebastian,
I am investing most of my time ( except from 6-7 sleeping hours ) in RapidMiner ,
and I think I finally getting into it
( although i know nothing on AI/NLP algorithms and i am not a java expert ) .
I read the website RSS regularly , so i might even contribute one day soon .
I implemented your IORetriver idea , and it works fine.
I had 2 issues , and i think i have the solutions :
1. it creates 5 words files for each feature - so i think i will use the new "Script Operator" to append the feature name to each word and append the whole words to one file , so i could call it from my java code.
2. For the classifying process , i need to build exampleSet ( not like the SingleTextInput classification example ) - so i thought of using the TextObject2ExampleSet.java file as example code .
I didn't implement my java code yet - but i will do it soon.
for the meanwhile i did some other stuff - and i think i miss-understand some basic RapidMiner concepts.
1. the word_list concept
--------------------
a. i cannot understand why we need the word_list at all , why the model file isn't enough ?
b. as i understand from the word file , it has counters per number of documents , and it is weird for me ...
( i think it suppose to be weighted count of words per category .. - no ? - i probably miss something .. )
i think it is the reason that when the model don't know how to categorize it gives me the category that contains the most documents ..
( maybe because it contains the most "http" words ? )
c. how would i know what is the threshold - that the model is sure for its category ?
d. maybe i should create a "general" category , that will indicate low confidence ?
e. maybe i should take the 2-3 best categories ? how do i do it ?
2. performance
--------------
I took the basic SingleTextInput classification example and put 45 classes/categories instead of 2 classes/categories .
The model size was amazing !!! OctopusModel.mod 3.6M and OctopusWordList.txt 422K !!!
but its apply()/classification java method is rather slow , can i do something about it ?
i have nice amount of RAM ( it takes a lot of it .. ) , and configure java accordingly so it shouldn't be the problem ..
it is not a big deal - but i just wonder if my parameters are ok ..
These are the parameters :
StringTextInput:prune_below = 10 ( i tried several parameters to reduce the size)
992 examples
4256 string attributes
Total number of Support Vectors: 809
Bias (offset): -0.321
number of classes: 45
Thanks a lot for everything !!!!!
Jorno
0 -
Hi Jorno,
I'm not quite sure, that I understand what you are going to do. But whatever it is, it seems, that you are willig to go there fast
One small note: The easiest way for getting a great amount of text into an example set is constructing an exampleset with a string attribute and then use the StringTextInput operator. Thats probably easier than implementing a new operator...
Unfortunately I cannot give you the theoretical background for understanding why all the data in the wordlist and the model must be saved, it would just exceed the scope of this forum. If you'd like, you could participate in one of our seminars or webinars for more detailed information beside: It is all needed for calculating different things...
Ok, finally I will give you at least one small theoretical piece of information: A SVM only can distinguish between two classes, because it uses a separating hyperplane. If you have 45 classes like in your case, you will have to think about a possibility to transform the problem into one with only 2 classes. One major approach is to learn 45 models: One for every class against all other classes. During application you will have to apply these 45 models and assing the to the one class, having the highest confidence when predicted against all others. Everything clear?
Greetings,
Sebastian0 -
Thanks a lot Sebastian ,
I didn't try anything unusaul - my tutor assignment is to classify websites to news,porn,entertainment,sports etc.
so , i just took the http://nemoz.org/joomla/content/view/65/53/lang,de/ example and put 45 "classes"/"groups" , the SVM seems to classify rather OK for the 45 groups for the full page text. forgive me but i don't understand the theory behind it ( word_list etc. ) ..
Then i tried to do it not for the full page text - but for each "feature" (title/keywords..)...- ( similar to the XML i post in this thread ) . . - and it seems very hard (1). loading the "features" to exampleSet in java (2). the words_list for each feature etc.
believe me , one of my biggest dreams is to take one of these courses :
http://rapid-i.com/content/view/73/148/
http://rapid-i.com/content/view/87/149/
http://rapid-i.com/content/view/125/150/
but it costs too much for someone like me + flights ... , i searched for a webinar at your site but i found nothing , can u ask questions on the webinar ? how much it costs ? I will be happy to know ...
i really understand your consulting model - and i appreciate it a lot !
you gave the world the open-source - and i - and i believe that all the community thank u !
maybe you could think on a biz model for "small" questions ( like http://www.liveperson.com ) ? just a small thought ...
In any case , thank u so much for your great product and help so far !!!
Jorno0 -
Hi Jorno,
I think if you only use one of this features, it will not contain enough information to be sure about the class. You might try it yourself: Only take a look of the content in this feature and guess the correct class. Probably you won't be too successfull...
We are currently working with a provider for webinars to build up the infra structure. They will be announced soon.
In fact we do have something for smaller problems: Telefon consulting, calculated per hour. And less than an hour isn't quite enough for such a complex field...
Greetings,
Sebastian0