[SOLVED] Really basic question, I think I'm applying models wrong.

My first read database gets all of the values from the documents (20k)
My second read database(1k documents) has a value isGood = 1 if the value is good, -2 if the value is bad and a bunch of other really bad ideas.. I set isGood to label. Should I actually only be passing true/false or is an integer okay?
I use nominal to text to get the "data" field as text.
I then process the document, looking for word frequencies etc.
Is my Naive bayes even in the right place?
My end goal is that I feed it 1000 known good documents and it can find very similar documents from the first read database... I want my confidence score to be based on document similarity.
I am getting an output that contains confidence but I'm not sure how to present my output, I don't come from a statistical background so I'm learning on my feet. I appreciate I have a lot to learn so in 3 weeks time I'm going to read some books/content about how to use rapidminer and ML in general. I can only apologize for my ignorance!
TLDR;
Can I use an integer as a label?
Am I using naive bayes and apply model correctly?
How can I view my data in an easy to interpret way. Ideally something like a list of document IDs with their confidence rating.
Thanks guys!
Find more posts tagged with
Sort by:
1 - 24 of
241
Hi,
some additions from my side:
- are you sure that your training data contains more than one value for isGood? If it contains only examples of one class, that could cause the error message.
- For Text Processing it is very important to use the same word list for training and application. Thus you have to connect the "wor" output of the Process Documents operator in the training branch to the "wor" input in the application branch. That way it is guaranteed that training and application example sets contain the same word vectors.
- do your integer values in isGood imply an order, or are they actually categories? In the latter case you should convert the label to a nominal value, so Naive Bayes will perform a classification. If it is left to Integer, it will perform a regression.
Best,
Marius
some additions from my side:
- are you sure that your training data contains more than one value for isGood? If it contains only examples of one class, that could cause the error message.
- For Text Processing it is very important to use the same word list for training and application. Thus you have to connect the "wor" output of the Process Documents operator in the training branch to the "wor" input in the application branch. That way it is guaranteed that training and application example sets contain the same word vectors.
- do your integer values in isGood imply an order, or are they actually categories? In the latter case you should convert the label to a nominal value, so Naive Bayes will perform a classification. If it is left to Integer, it will perform a regression.
Best,
Marius
Hey guys, so I made some progress.
I extended my DB structure to support a label field and set any that are known positive matches as true and any known negatives as false.
I use these MySQL select queries:
SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = "true" OR isGood = -2 AND school_list_holiday_sources.label = "false" LIMIT 0,50
This select gets the items with a label true and false. Naive Bayes learnes from these.
SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != 1 AND isGood = 0 ORDER BY score desc LIMIT 0,10
This select gets all of the items that dont have a true or false label
This select gets all of the items that dont have a true or false label.
My output data doesn't have any confidence rating. Should it?
It looks like this:
Thanks!
PS if someone could add me on skype/other IM service I'd be happy to screen share and work on this in real time?
I extended my DB structure to support a label field and set any that are known positive matches as true and any known negatives as false.
I use these MySQL select queries:
SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = "true" OR isGood = -2 AND school_list_holiday_sources.label = "false" LIMIT 0,50
This select gets the items with a label true and false. Naive Bayes learnes from these.
SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != 1 AND isGood = 0 ORDER BY score desc LIMIT 0,10
This select gets all of the items that dont have a true or false label
This select gets all of the items that dont have a true or false label.
My output data doesn't have any confidence rating. Should it?
It looks like this:

Thanks!
PS if someone could add me on skype/other IM service I'd be happy to screen share and work on this in real time?
Hi,
if you applied the classification model: yes, your output should contain predictions and confidences. It would be helpful if you posted your process as XML here, so we can check the setup. You get the XML code via the XML tab at the top of the process view in RapidMiner. Just copy the text from there into your next answer, and please use the #-button on top of the input box for that.
Best,
Marius
if you applied the classification model: yes, your output should contain predictions and confidences. It would be helpful if you posted your process as XML here, so we can check the setup. You get the XML code via the XML tab at the top of the process view in RapidMiner. Just copy the text from there into your next answer, and please use the #-button on top of the input box for that.
Best,
Marius
Here ya go
http://beta.etherpad.org/p/rapidminer
<?xml version="1.0" encoding="UTF-8" standalone="no"?>Editable here
<process version="5.1.014">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process">
<process expanded="true" height="369" width="835">
<operator activated="true" class="read_database" compatibility="5.1.014" expanded="true" height="60" name="Read Database" width="90" x="45" y="210">
<parameter key="connection" value="slave2"/>
<parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != "true" AND isGood = 0 ORDER BY score desc LIMIT 0,10"/>
<enumeration key="parameters"/>
</operator>
<operator activated="true" class="read_database" compatibility="5.1.014" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="75">
<parameter key="connection" value="slave2"/>
<parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = "true" OR isGood = -2 AND school_list_holiday_sources.label = "false" LIMIT 0,50"/>
<enumeration key="parameters"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="5.1.014" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="75">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="data"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="75">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="999"/>
<list key="specify_weights"/>
<process expanded="true" height="480" width="815">
<operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="45" y="120"/>
<operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="120"/>
<operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="313" y="120"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="120"/>
<operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="581" y="120"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="715" y="120">
<parameter key="min_chars" value="2"/>
</operator>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="90"/>
<portSpacing port="sink_document 1" spacing="90"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="5.1.014" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="210">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="data"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="999"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" name="Extract Content (2)"/>
<operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" name="Transform Cases (2)"/>
<operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" name="Tokenize (2)"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" name="Filter Stopwords (2)"/>
<operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" name="Stem (2)"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" name="Filter Tokens (2)">
<parameter key="min_chars" value="2"/>
</operator>
<connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
<connect from_op="Extract Content (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
<connect from_op="Stem (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_role" compatibility="5.1.014" expanded="true" height="76" name="Set Role (2)" width="90" x="447" y="75">
<parameter key="name" value="label"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="naive_bayes" compatibility="5.1.014" expanded="true" height="76" name="Naive Bayes" width="90" x="581" y="75">
<parameter key="laplace_correction" value="false"/>
</operator>
<operator activated="true" class="apply_model" compatibility="5.1.014" expanded="true" height="76" name="Apply Model (2)" width="90" x="715" y="210">
<list key="application_parameters"/>
</operator>
<connect from_op="Read Database" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
<connect from_op="Read Database (2)" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
<connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Set Role (2)" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="90"/>
</process>
</operator>
</process>
http://beta.etherpad.org/p/rapidminer
Hi there,
please try this one and let me know if it works:
Cheers,
Ingo
please try this one and let me know if it works:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
<process expanded="true" height="369" width="835">
<operator activated="true" class="read_database" compatibility="5.1.017" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="30">
<parameter key="connection" value="slave2"/>
<parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = "true" OR isGood = -2 AND school_list_holiday_sources.label = "false" LIMIT 0,50"/>
<enumeration key="parameters"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="5.1.017" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="data"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="999"/>
<list key="specify_weights"/>
<process expanded="true" height="480" width="815">
<operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="45" y="120"/>
<operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="120"/>
<operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="313" y="120"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="120"/>
<operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="581" y="120"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="715" y="120">
<parameter key="min_chars" value="2"/>
</operator>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="90"/>
<portSpacing port="sink_document 1" spacing="90"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_role" compatibility="5.1.017" expanded="true" height="76" name="Set Role (2)" width="90" x="447" y="30">
<parameter key="name" value="label"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="naive_bayes" compatibility="5.1.017" expanded="true" height="76" name="Naive Bayes" width="90" x="581" y="30">
<parameter key="laplace_correction" value="false"/>
</operator>
<operator activated="true" class="read_database" compatibility="5.1.017" expanded="true" height="60" name="Read Database" width="90" x="45" y="210">
<parameter key="connection" value="slave2"/>
<parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != "true" AND isGood = 0 ORDER BY score desc LIMIT 0,10"/>
<enumeration key="parameters"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="5.1.017" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="210">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="data"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="999"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" name="Extract Content (2)"/>
<operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" name="Transform Cases (2)"/>
<operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" name="Tokenize (2)"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" name="Filter Stopwords (2)"/>
<operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" name="Stem (2)"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" name="Filter Tokens (2)">
<parameter key="min_chars" value="2"/>
</operator>
<connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
<connect from_op="Extract Content (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
<connect from_op="Stem (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model (2)" width="90" x="715" y="210">
<list key="application_parameters"/>
</operator>
<connect from_op="Read Database (2)" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
<connect from_op="Set Role (2)" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Read Database" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
<connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
<connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="180"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Cheers,
Ingo
Hi,
are you sure that you have pressed the green check icon after inserting the XML (I frequently forget this ;D ). The difference is really small: I just have connected the output port with the word list of the first operator for text processing with the input port for the word list of the second one. This is definitely necessary, since otherwise the resulting example sets would differ and a prediction is not possible then. This should actually also be stated in the log, by the way.
Another thing which cames into my mind is the fact that your query delivers an attribute label, which get the role "label" during training but not during testing. Remove this or also set the role to label before model application. Here is the suggested process:
Cheers,
Ingo
are you sure that you have pressed the green check icon after inserting the XML (I frequently forget this ;D ). The difference is really small: I just have connected the output port with the word list of the first operator for text processing with the input port for the word list of the second one. This is definitely necessary, since otherwise the resulting example sets would differ and a prediction is not possible then. This should actually also be stated in the log, by the way.
Another thing which cames into my mind is the fact that your query delivers an attribute label, which get the role "label" during training but not during testing. Remove this or also set the role to label before model application. Here is the suggested process:
If this things are not the reason, I am afraid I would have to look into the data and the transformed data (i.e. the two example sets which are actually delivered to the learner - do they really contain regular attributes? Are those the same for training and testing?
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
<process expanded="true" height="369" width="835">
<operator activated="true" class="read_database" compatibility="5.1.017" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="30">
<parameter key="connection" value="slave2"/>
<parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = "true" OR isGood = -2 AND school_list_holiday_sources.label = "false" LIMIT 0,50"/>
<enumeration key="parameters"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="5.1.017" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="data"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="999"/>
<list key="specify_weights"/>
<process expanded="true" height="480" width="815">
<operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="45" y="120"/>
<operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="120"/>
<operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="313" y="120"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="120"/>
<operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="581" y="120"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="715" y="120">
<parameter key="min_chars" value="2"/>
</operator>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="90"/>
<portSpacing port="sink_document 1" spacing="90"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_role" compatibility="5.1.017" expanded="true" height="76" name="Set Role (2)" width="90" x="447" y="30">
<parameter key="name" value="label"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="naive_bayes" compatibility="5.1.017" expanded="true" height="76" name="Naive Bayes" width="90" x="581" y="30">
<parameter key="laplace_correction" value="false"/>
</operator>
<operator activated="true" class="read_database" compatibility="5.1.017" expanded="true" height="60" name="Read Database" width="90" x="45" y="210">
<parameter key="connection" value="slave2"/>
<parameter key="query" value="SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != "true" AND isGood = 0 ORDER BY score desc LIMIT 0,10"/>
<enumeration key="parameters"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="5.1.017" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="210">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="data"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="999"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" name="Extract Content (2)"/>
<operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" name="Transform Cases (2)"/>
<operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" name="Tokenize (2)"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" name="Filter Stopwords (2)"/>
<operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" name="Stem (2)"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" name="Filter Tokens (2)">
<parameter key="min_chars" value="2"/>
</operator>
<connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
<connect from_op="Extract Content (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
<connect from_op="Stem (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_role" compatibility="5.1.017" expanded="true" height="76" name="Set Role (3)" width="90" x="447" y="210">
<parameter key="name" value="label"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model (2)" width="90" x="715" y="120">
<list key="application_parameters"/>
</operator>
<connect from_op="Read Database (2)" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
<connect from_op="Set Role (2)" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Read Database" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
<connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
<connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Set Role (3)" to_port="example set input"/>
<connect from_op="Set Role (3)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="90"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Cheers,
Ingo
Hi,
). But let's face it: although we are usually not working on an per-hour base maybe at some point of time this would indeed be the most time-efficient thing to do if you or others here do not find the reason...
If somebody else has more time and wants to dive deeper into this: the next thing I would check is what is delivered to the learner (see my questions below) and to the operator Apply Model together with the log messages. If the dimension is really high, maybe another learner would also be more appropriate. Just my 2c.
Cheers,
Ingo
Even with the second one process with the additional Set Role operator? Weird. Now indeed we have to inspect the data delivered to the learner and apply model operators (see questions below).
I didn't check the green check, however when I did I still get the same output.
Yeah, sounds like fun but I am out of office and surfing only via my mobile phone. And usually I charge 200 Euro per hour for this type of consulting (but be assured: we have some junior consultants which are less expensive
If you want I can share my screen via skype and we can make modifications in real time?

If somebody else has more time and wants to dive deeper into this: the next thing I would check is what is delivered to the learner (see my questions below) and to the operator Apply Model together with the log messages. If the dimension is really high, maybe another learner would also be more appropriate. Just my 2c.
Cheers,
Ingo
I'm happy to paypal some cash over, I would expect it is only a 5 minute job as my task is so simple and I'm probably only missing a checkbox somewhere!
Would anyone be willing to just do it as a side job and not charge the 200 euros per hour but maybe 20 euros for 5 minutes of your time or maybe I can donate some money to charity or to your favorite open source project?
Would anyone be willing to just do it as a side job and not charge the 200 euros per hour but maybe 20 euros for 5 minutes of your time or maybe I can donate some money to charity or to your favorite open source project?

Hi,
as Ingo said above: please check your data, and also your SQL queries. To me it seems a bit odd that you said that you want to use isGood as label, but are fetching a label column from the database. Next, in your screenshot of the data the columns for label and isGood are almost empty. Please check that you are fetching correct data sets by putting a breakpoint on the Read Database operators.
Best,
Marius
as Ingo said above: please check your data, and also your SQL queries. To me it seems a bit odd that you said that you want to use isGood as label, but are fetching a label column from the database. Next, in your screenshot of the data the columns for label and isGood are almost empty. Please check that you are fetching correct data sets by putting a breakpoint on the Read Database operators.
Best,
Marius
At least it does not look wrong... this is the Naive Bayes model created by the Naive Bayes operator. More interesting would be the distribution table of that model (access it via the radio buttons in the results view). But you are probably more interested in the labelled result set. Thus, you have to connect the lab output of Apply Model to the result output. Anyway, the process Ingo posted should work.
If you still don't get valid results, again check the following:
Did you:
- connect the wordlist output of the Process Documents output in the training branch to the input of Process Documents in the Apply branch?
- did you double check that you read correct data from both Read Database operators?
- if you don't use isGood, don't retrieve it from the database.
- find out why the label attribute is empty after Process Documents, and try to fix it. Is already empty directly after the Read Database operators?
Best, Marius
If you still don't get valid results, again check the following:
Did you:
- connect the wordlist output of the Process Documents output in the training branch to the input of Process Documents in the Apply branch?
- did you double check that you read correct data from both Read Database operators?
- if you don't use isGood, don't retrieve it from the database.
- find out why the label attribute is empty after Process Documents, and try to fix it. Is already empty directly after the Read Database operators?
Best, Marius
I would claim that you changed your SQL statement and don't fetch a "data" attribute with the text anymore, but your text attributes are now called "Title" and "Description". Thus, the Nominal to Text operators have to be adapted such that they don't operate on "data", but on the two new attributes. If you have only text attributes and the label, you could use "filter type" all and uncheck "include special attributes".
Didn't you get a warning or error in the "Problems" view at the bottom of RapidMiner saying sth like "The example set must contain at least one text attribute"?
Best, Marius
Didn't you get a warning or error in the "Problems" view at the bottom of RapidMiner saying sth like "The example set must contain at least one text attribute"?
Best, Marius
SQL statements only get Data.
Include special attributes not checked.
Didn't get any warnings..
View:

XML is this:
Include special attributes not checked.
Didn't get any warnings..
View:

XML is this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.000" expanded="true" name="Process">
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true" height="386" width="835">
<operator activated="true" class="read_database" compatibility="5.2.000" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="30">
<parameter key="connection" value="slave2"/>
<parameter key="query" value="SELECT label, data, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = "true" OR isGood = -2 AND school_list_holiday_sources.label = "false" LIMIT 0,100"/>
<enumeration key="parameters"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="5.2.000" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="data"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="999"/>
<list key="specify_weights"/>
<process expanded="true" height="480" width="815">
<operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="45" y="120"/>
<operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="120"/>
<operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="313" y="120"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="120"/>
<operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="581" y="120"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="715" y="120">
<parameter key="min_chars" value="2"/>
</operator>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="90"/>
<portSpacing port="sink_document 1" spacing="90"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="read_database" compatibility="5.2.000" expanded="true" height="60" name="Read Database" width="90" x="45" y="210">
<parameter key="connection" value="slave2"/>
<parameter key="query" value="SELECT data, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != "true" AND isGood = 0 LIMIT 0,100"/>
<enumeration key="parameters"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="5.2.000" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="210">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="data"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.1.004" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="999"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" name="Extract Content (2)"/>
<operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" name="Transform Cases (2)"/>
<operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" name="Tokenize (2)"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" name="Filter Stopwords (2)"/>
<operator activated="true" class="text:stem_snowball" compatibility="5.1.004" expanded="true" name="Stem (2)"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" name="Filter Tokens (2)">
<parameter key="min_chars" value="2"/>
</operator>
<connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
<connect from_op="Extract Content (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
<connect from_op="Stem (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_role" compatibility="5.2.000" expanded="true" height="76" name="Set Role" width="90" x="447" y="30">
<parameter key="name" value="label"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="naive_bayes" compatibility="5.2.000" expanded="true" height="76" name="Naive Bayes" width="90" x="581" y="30">
<parameter key="laplace_correction" value="false"/>
</operator>
<operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model (2)" width="90" x="715" y="120">
<list key="application_parameters"/>
</operator>
<connect from_op="Read Database (2)" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role" to_port="example set input"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
<connect from_op="Read Database" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
<connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
<connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Set Role" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
<connect from_op="Apply Model (2)" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="90"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Your current XML does not set the label role on the test set, but it does on the training set.
I refer you to earlier posts in this thread from Ingo and to the help for this operator...
I refer you to earlier posts in this thread from Ingo and to the help for this operator...
Please pay attention to the fact, that the application of Models will need the same attributes during application on an ExampleSet that where part of the ExampleSet it was trained on. Some minor changes like adding attributes might be possible, but might cause severe calculation errors. Please make sure, that the attributes' number, order, type and role are consistent during training and application.
As far as your variables go, I don't think there is a technical reason why you can't use integers, however the spread of your variables is odd. I would use 1 and 0 (1 is good, 0 is not good) if I were using integers. Someone else will need to say whether there needs to be a numeric to nominal process in there on your label. That is how my job is set up.
Regarding output, what you need to do is save the output of the apply model, either to a csv file or to the repository. Then you can extract the fields you need from it (ID and prediction(yes).
BTW, I'm one step less of a newbie than you are, so I hope others will jump in and correct both of us. However I am sure about your read's being backwards so you should start with fixing that.