Text mining classification with multiple classes
Hi,
I am relatively new to data science and therefore I have some questions:
I’m working on a text mining multi-class classification problem for a study assignment. The aim of my assignment is to build a model that predicts the ‘score’ attribute of textual reviews of products. The possible ‘score’ attribute values (classes) are 1,2,3,4 or 5, so it is like a star rating of reviews. My dataset contains 6 features:
- ReviewerID, ReviewerName, ReviewText, Score, Summary and the length of my textual review.
- There are 5000 reviews (rows) in my dataset and a few missing values (ReviewerName)
- 3000 reviews are 5 star reviews, 1000 reviews are 4 star reviews and the rest of the reviews is a 1, 2 or 3 star review. The classes are imbalanced.
- I've uploaded the dataset
I have used various classification methods (kNN, naïve Bayes and Logistic regression SVM) but I cannot seem to achieve a higher accuracy of my model that 62%. I don’t know if this is a good accuracy or not, the random guess in 20% but I have the idea that there are things I can do to make a more accurate model. If I try to rebalance the dataset the accuracy drops to max 40%.
The process is: Read CSV (using quotes) -> numerical to polynomial > set role (‘score’ as label) > nominal to text > select attributes (reviewer ID is left out) > split data (70%/30%) > process documents (tokenize, stem, filter stop words, transform cases, generate n-grams (2)) > cross validation 10 fold -> KNN) > performance)
I don’t know if miss steps in my process or that I make mistakes or maybe 62% accuracy is the max. I hope that someone can help me out or give me tips!
Thanks!
Greetings Marijn
Answers
-
Please post your XML, use the </> option to paste it in.
1 -
62% is not that bad, specifically when using review ratings as main label.
There are a couple of 'traps' when looking at review ratings, having some experience myself with Amazon review ratings here are some of my observations :
Culture plays a role : Not sure how your dataset is balanced, but when using european data it is for instance very obvious that the more southern you go (France, Spain, Portugal etc) the likelyhood people will give a 5 even if not perfectly happy rises, whereas the more northern you go (netherlands, germany etc) people tend to consider a 3 already a high score, as perfection doesn't exist. Bit of black and white picture but the differences are clear. A 5 in Spain can be like a 4 in Belgium and a 3 in Germany.
Ambuiguity is king : People saying feature A is great but feature b sucks, but that's ok since I don't use it anyway so the score is still high, this happens quite a lot having an impact on your score since algorithms tend to give this a neutral score as the negative compensates the possitive.
Multitopic : bit related to the above, where people tend to go through the complete feature list, leading again to 'flat scores'
How we tackled this : We used the ratings to do a first clustering, but combining 4 and 5 (mainly possitive), 3 as neutral, 1 and 2 as negative. This should give already better results as the 5 scale logic since that will never work reliably
next we worked in 2 flows, first topic analysis to get rid of all the small talk, then perform sentiment analysis on topics by review. Since topics can have different weights this will also have an impact on the overall happyness associated with a review. Simply put, when reviewing for instance a headphone review the sentiment towards the sound will be more important than the sentiment towards packaging material.
Hope this helps a bit, but best advice is already to bring down your 5 labels to 3.
2 -
Hi guys,
Thanks for your replies, they are very helpfull! Here is my process xml:
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="7.6.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
<parameter key="csv_file" value="C:\Users\marijn.nieboer\Desktop\Studie\BPMIT\Data analytics\Opdrachten\Task 2\D3\AmazonSampleForStudent.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="reviewerID.true.polynominal.attribute"/>
<parameter key="1" value="reviewerName.true.polynominal.attribute"/>
<parameter key="2" value="reviewText.true.polynominal.attribute"/>
<parameter key="3" value="score.true.numeric.attribute"/>
<parameter key="4" value="summary.true.polynominal.attribute"/>
<parameter key="5" value="len_text.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="numerical_to_polynominal" compatibility="7.6.003" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value="sentiment"/>
<parameter key="attributes" value="len_text|score"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.003" expanded="true" height="82" name="Set Role" width="90" x="246" y="187">
<parameter key="attribute_name" value="score"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.6.003" expanded="true" height="82" name="Nominal to Text" width="90" x="246" y="340">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value="sentence"/>
<parameter key="attributes" value="reviewText|summary|reviewerName"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.6.003" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="493">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="score|reviewText|reviewerName|summary|reviewerID"/>
</operator>
<operator activated="true" class="split_data" compatibility="7.6.003" expanded="true" height="103" name="Split Data" width="90" x="380" y="493">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
<parameter key="sampling_type" value="shuffled sampling"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="340">
<parameter key="prune_method" value="percentual"/>
<parameter key="prune_below_percent" value="1.0"/>
<parameter key="prune_above_percent" value="90.0"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="136"/>
<operator activated="true" class="text:stem_porter" compatibility="7.5.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="179" y="289"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="238"/>
<operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="447" y="238"/>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="581" y="187"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
<description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="84" y="17">Split words</description>
<description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="445" y="344">Remove Stop Words and put everything to lower-case</description>
<description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="534" y="75">n-grams transformation</description>
</process>
</operator>
<operator activated="false" class="subprocess" compatibility="7.6.003" expanded="true" height="82" name="outlier subproces" width="90" x="514" y="187">
<process expanded="true">
<operator activated="true" class="detect_outlier_distances" compatibility="7.6.003" expanded="true" height="82" name="Detect Outlier (Distances)" width="90" x="45" y="34">
<parameter key="number_of_outliers" value="20"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="7.6.003" expanded="true" height="103" name="Filter Examples" width="90" x="45" y="136">
<list key="filters_list">
<parameter key="filters_entry_key" value="outlier.equals.false"/>
</list>
</operator>
<connect from_port="in 1" to_op="Detect Outlier (Distances)" to_port="example set input"/>
<connect from_op="Detect Outlier (Distances)" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="648" y="493">
<parameter key="prune_method" value="percentual"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="112" y="136"/>
<operator activated="true" class="text:stem_porter" compatibility="7.5.000" expanded="true" height="68" name="Stem (2)" width="90" x="246" y="136"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="447" y="136"/>
<operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="581" y="136"/>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (2)" width="90" x="715" y="136"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
<connect from_op="Stem (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
<connect from_op="Generate n-Grams (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="7.6.003" expanded="true" height="145" name="Cross Validation" width="90" x="648" y="34">
<process expanded="true">
<operator activated="true" class="k_nn" compatibility="7.6.003" expanded="true" height="82" name="k-NN" width="90" x="179" y="34">
<parameter key="k" value="15"/>
</operator>
<operator activated="false" class="naive_bayes" compatibility="7.6.003" expanded="true" height="82" name="Naive Bayes" width="90" x="179" y="238"/>
<connect from_port="training set" to_op="k-NN" to_port="training set"/>
<connect from_op="k-NN" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="7.6.003" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.6.003" expanded="true" height="82" name="Performance" width="90" x="246" y="34">
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="performance 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="apply_model" compatibility="7.6.003" expanded="true" height="82" name="Apply Model (2)" width="90" x="715" y="340">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.6.003" expanded="true" height="82" name="Performance (2)" width="90" x="782" y="187">
<list key="class_weights"/>
</operator>
<operator activated="false" class="subprocess" compatibility="7.6.003" expanded="true" height="68" name="Missing values" width="90" x="45" y="238">
<process expanded="true">
<operator activated="true" class="declare_missing_value" compatibility="7.6.003" expanded="true" height="82" name="Declare Missing Value" width="90" x="45" y="34"/>
<operator activated="true" class="filter_examples" compatibility="7.6.003" expanded="true" height="103" name="Filter Examples (2)" width="90" x="45" y="136">
<parameter key="invert_filter" value="true"/>
<list key="filters_list">
<parameter key="filters_entry_key" value="reviewerName.equals.?"/>
</list>
</operator>
<connect from_op="Declare Missing Value" from_port="example set output" to_op="Filter Examples (2)" to_port="example set input"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
</process>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Numerical to Polynominal" to_port="example set input"/>
<connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 2" to_op="Process Documents from Data (2)" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Cross Validation" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
<connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Cross Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
<connect from_op="Cross Validation" from_port="performance 1" to_port="result 1"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
<connect from_op="Performance (2)" from_port="performance" to_port="result 3"/>
<connect from_op="Performance (2)" from_port="example set" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<description align="center" color="yellow" colored="false" height="77" resized="true" width="187" x="356" y="17">The labels are numerical here, this is why I am doing Numerical to Polynomial</description>
<description align="center" color="yellow" colored="false" height="59" resized="true" width="109" x="358" y="209">'score' is set to be the label</description>
<description align="center" color="yellow" colored="false" height="85" resized="true" width="140" x="86" y="347">I need to transform the cell with the text from nominal to text.</description>
<description align="center" color="yellow" colored="false" height="50" resized="true" width="122" x="376" y="604">SPLIT Training and Testing</description>
<description align="center" color="yellow" colored="false" height="66" resized="true" width="211" x="10" y="163">Removing rows with missing values (reviewerName) does not improve accuracy</description>
<description align="center" color="yellow" colored="false" height="63" resized="true" width="154" x="464" y="120">Removing outliers has negative imapact on accuracy</description>
</process>
</operator>
</process>One more question: Which operator can I use to reduce the number of classes (1,2,3,4 and 5) to 3 classes, where:
- 1 and 2 are 'Negativ'
- 3 is 'Neutral'
- 4 and 5 are 'Positive'
Greetings Marijn
0 -
Hi @marijn_nbr,
You can use the Discretize (Discretize by User Specification) operator to reduce the number of classes of your label from 5 to 3
Here the process with the insertion of this new operator.
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="8.0.001" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
<parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Amazon_Classification\AmazonSampleForStudent.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="reviewerID.true.polynominal.attribute"/>
<parameter key="1" value="reviewerName.true.polynominal.attribute"/>
<parameter key="2" value="reviewText.true.polynominal.attribute"/>
<parameter key="3" value="score.true.integer.attribute"/>
<parameter key="4" value="summary.true.polynominal.attribute"/>
<parameter key="5" value="len_text.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="numerical_to_polynominal" compatibility="8.0.001" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value="sentiment"/>
<parameter key="attributes" value="len_text"/>
</operator>
<operator activated="true" class="discretize_by_user_specification" compatibility="8.0.001" expanded="true" height="103" name="Discretize" width="90" x="246" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="score"/>
<list key="classes">
<parameter key="negative" value="2.5"/>
<parameter key="neutral" value="3.5"/>
<parameter key="positive" value="5.5"/>
</list>
</operator>
<operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role" width="90" x="246" y="289">
<parameter key="attribute_name" value="score"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="8.0.001" expanded="true" height="82" name="Nominal to Text" width="90" x="246" y="391">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value="sentence"/>
<parameter key="attributes" value="reviewText|summary|reviewerName"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="544">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="score|reviewText|reviewerName|summary|reviewerID"/>
</operator>
<operator activated="true" class="split_data" compatibility="8.0.001" expanded="true" height="103" name="Split Data" width="90" x="380" y="493">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
<parameter key="sampling_type" value="shuffled sampling"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="340">
<parameter key="prune_method" value="percentual"/>
<parameter key="prune_below_percent" value="1.0"/>
<parameter key="prune_above_percent" value="90.0"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="136"/>
<operator activated="true" class="text:stem_porter" compatibility="7.5.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="179" y="289"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="238"/>
<operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="447" y="238"/>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="581" y="187"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
<description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="84" y="17">Split words</description>
<description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="445" y="344">Remove Stop Words and put everything to lower-case</description>
<description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="534" y="75">n-grams transformation</description>
</process>
</operator>
<operator activated="false" class="subprocess" compatibility="8.0.001" expanded="true" height="82" name="outlier subproces" width="90" x="514" y="187">
<process expanded="true">
<operator activated="true" class="detect_outlier_distances" compatibility="8.0.001" expanded="true" height="82" name="Detect Outlier (Distances)" width="90" x="45" y="34">
<parameter key="number_of_outliers" value="20"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="8.0.001" expanded="true" height="103" name="Filter Examples" width="90" x="45" y="136">
<list key="filters_list">
<parameter key="filters_entry_key" value="outlier.equals.false"/>
</list>
</operator>
<connect from_port="in 1" to_op="Detect Outlier (Distances)" to_port="example set input"/>
<connect from_op="Detect Outlier (Distances)" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="648" y="493">
<parameter key="prune_method" value="percentual"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="112" y="136"/>
<operator activated="true" class="text:stem_porter" compatibility="7.5.000" expanded="true" height="68" name="Stem (2)" width="90" x="246" y="136"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="447" y="136"/>
<operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="581" y="136"/>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (2)" width="90" x="715" y="136"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
<connect from_op="Stem (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
<connect from_op="Generate n-Grams (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="8.0.001" expanded="true" height="145" name="Cross Validation" width="90" x="648" y="34">
<process expanded="true">
<operator activated="true" class="k_nn" compatibility="8.0.001" expanded="true" height="82" name="k-NN" width="90" x="179" y="34">
<parameter key="k" value="15"/>
</operator>
<operator activated="false" class="naive_bayes" compatibility="8.0.001" expanded="true" height="82" name="Naive Bayes" width="90" x="179" y="238"/>
<connect from_port="training set" to_op="k-NN" to_port="training set"/>
<connect from_op="k-NN" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="246" y="34">
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="performance 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="715" y="340">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="8.0.001" expanded="true" height="82" name="Performance (2)" width="90" x="782" y="187">
<list key="class_weights"/>
</operator>
<operator activated="false" class="subprocess" compatibility="8.0.001" expanded="true" height="68" name="Missing values" width="90" x="45" y="238">
<process expanded="true">
<operator activated="true" class="declare_missing_value" compatibility="8.0.001" expanded="true" height="82" name="Declare Missing Value" width="90" x="45" y="34"/>
<operator activated="true" class="filter_examples" compatibility="8.0.001" expanded="true" height="103" name="Filter Examples (2)" width="90" x="45" y="136">
<parameter key="invert_filter" value="true"/>
<list key="filters_list">
<parameter key="filters_entry_key" value="reviewerName.equals.?"/>
</list>
</operator>
<connect from_op="Declare Missing Value" from_port="example set output" to_op="Filter Examples (2)" to_port="example set input"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
</process>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Numerical to Polynominal" to_port="example set input"/>
<connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Discretize" to_port="example set input"/>
<connect from_op="Discretize" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 2" to_op="Process Documents from Data (2)" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Cross Validation" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
<connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Cross Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
<connect from_op="Cross Validation" from_port="performance 1" to_port="result 1"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
<connect from_op="Performance (2)" from_port="performance" to_port="result 3"/>
<connect from_op="Performance (2)" from_port="example set" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<description align="center" color="yellow" colored="false" height="77" resized="true" width="187" x="356" y="17">The labels are numerical here, this is why I am doing Numerical to Polynomial</description>
<description align="center" color="yellow" colored="false" height="59" resized="true" width="109" x="358" y="209">'score' is set to be the label</description>
<description align="center" color="yellow" colored="false" height="85" resized="true" width="140" x="86" y="347">I need to transform the cell with the text from nominal to text.</description>
<description align="center" color="yellow" colored="false" height="50" resized="true" width="122" x="376" y="604">SPLIT Training and Testing</description>
<description align="center" color="yellow" colored="false" height="66" resized="true" width="211" x="10" y="163">Removing rows with missing values (reviewerName) does not improve accuracy</description>
<description align="center" color="yellow" colored="false" height="63" resized="true" width="154" x="464" y="120">Removing outliers has negative imapact on accuracy</description>
</process>
</operator>
</process>As planned by @kayman, the accuracy of your model is significantly better with this transformation.
Regards,
Lionel
0