Predicting whether a product is a beverage or not using a csv

User: "luiz_vidal"
New Altair Community Member
Updated by Jocelyn

Hi guys,

I am quite new to Rapid Miner and here is my "problem"

I want to build a process in which I have 2 columns in a csv file (Desc - Description and Bebidas - 0 or 1 ), I want to predict if a product is a beverage (portuguese for bebida) by the description. I have gotten here so far

 

process.JPGMy processAfter I pass through this transformation though I put a Random Forest algorithm, but somehow I'm not able to tell which column is the prediction column, I also tried with Naive Bayes. I mean, the algorithm choice itself isn't an issue, but after processing documents I would like a manner to transform it to data again in order to use it for the prediction. Can someone help me to do it the right way? I'm kind of stuck.. thanks in advance.
Follow below the xml of my process

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Bebidas_100" width="90" x="45" y="34">
<parameter key="repository_entry" value="../../Workbooks/Bebidas_100"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="8.0.001" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="85">
<list key="function_descriptions">
<parameter key="Description" value="lower(Desc)"/>
<parameter key="É Bebida" value="if(Bebida==0,&quot;Não&quot;,&quot;Sim&quot;)"/>
</list>
</operator>
<operator activated="true" class="filter_examples" compatibility="8.0.001" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="136">
<list key="filters_list">
<parameter key="filters_entry_key" value="Bebida.is_not_missing."/>
</list>
</operator>
<operator activated="true" class="replace" compatibility="8.0.001" expanded="true" height="82" name="Replace" width="90" x="447" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Description"/>
<parameter key="attributes" value="Description|É Bebida"/>
<parameter key="regular_expression" value="[a-z]"/>
<parameter key="replace_what" value="[-!0-9&quot;#$%&amp;'()*+,./:;&lt;=&gt;?@\[\\\]_`{|}~]"/&gt;
</operator>
<operator activated="true" class="nominal_to_text" compatibility="8.0.001" expanded="true" height="82" name="Nominal to Text" width="90" x="581" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Description"/>
<parameter key="attributes" value="Description|É Bebida"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="715" y="136">
<list key="specify_weights"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents" width="90" x="916" y="34">
<parameter key="keep_text" value="true"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34"/>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.5.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="514" y="34">
<parameter key="file" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\stopwords.txt"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
<connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve Bebidas_100" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Replace" to_port="example set input"/>
<connect from_op="Replace" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "Thomas_Ott"
    New Altair Community Member
    Accepted Answer

    Your process is not quite what I'm used to when building text processing in RapidMiner. I don't understand what the Replace operator is doing? Is that supposed to help the tokenization? If so, you can select 'specify parameters' and paste it in there.

     

    Rearranging it, I would do something like this. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.003" expanded="true" height="68" name="Retrieve Bebidas_100" width="90" x="45" y="34">
    <parameter key="repository_entry" value="../../Workbooks/Bebidas_100"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="7.6.003" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="Bebida.is_not_missing."/>
    </list>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.6.003" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="34">
    <list key="function_descriptions">
    <parameter key="Description" value="lower(Desc)"/>
    <parameter key="É Bebida" value="if(Bebida==0,&quot;Não&quot;,&quot;Sim&quot;)"/>
    </list>
    </operator>
    <operator activated="true" class="replace" compatibility="7.6.003" expanded="true" height="82" name="Replace" width="90" x="581" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Description"/>
    <parameter key="attributes" value="Description|É Bebida"/>
    <parameter key="regular_expression" value="[a-z]"/>
    <parameter key="replace_what" value="[-!0-9&quot;#$%&amp;'()*+,./:;&lt;=&gt;?@\[\\\]_`{|}~]"/&gt;
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="7.6.003" expanded="true" height="82" name="Nominal to Text" width="90" x="715" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Description"/>
    <parameter key="attributes" value="Description|É Bebida"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="849" y="34">
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34"/>
    <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.5.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="514" y="34">
    <parameter key="file" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\stopwords.txt"/>
    </operator>
    <operator activated="false" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="313" y="85">
    <description align="center" color="transparent" colored="false" width="126">You can save yourself one Generate Attributes entry by using this operator to lower the case of your text</description>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
    <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.003" expanded="true" height="82" name="Set Role" width="90" x="447" y="34">
    <parameter key="attribute_name" value="É Bebida"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="concurrency:cross_validation" compatibility="7.6.003" expanded="true" height="145" name="Validation" width="90" x="983" y="34">
    <parameter key="sampling_type" value="stratified sampling"/>
    <process expanded="true">
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.003" expanded="true" height="82" name="Decision Tree" width="90" x="45" y="34"/>
    <connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
    <connect from_op="Decision Tree" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    <description align="left" color="green" colored="true" height="80" resized="true" width="248" x="37" y="137">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.6.003" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance" compatibility="7.6.003" expanded="true" height="82" name="Performance" width="90" x="179" y="34"/>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
    <connect from_op="Performance" from_port="example set" to_port="test set results"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    <description align="left" color="blue" colored="true" height="103" resized="true" width="315" x="38" y="137">The model created in the Training step is applied to the current test set (10 %).&lt;br/&gt;The performance is evaluated and sent to the operator results.</description>
    </process>
    <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
    </operator>
    <connect from_op="Retrieve Bebidas_100" from_port="output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Replace" to_port="example set input"/>
    <connect from_op="Replace" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Validation" to_port="example set"/>
    <connect from_op="Validation" from_port="model" to_port="result 2"/>
    <connect from_op="Validation" from_port="performance 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    Of course, swap out the decision tree algo for the one you want, but this process passes the label with classes '0' and '1' to the Text Processing and then trains on it using a Cross Validation. You might get horrible accuracy in the first pass but adjusting the pruning, algorithm, and parameter optimization all help.