🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

association rules data preparation help needed

User: "piernik"
New Altair Community Member
Updated by Jocelyn

Hi all, 

I am new to RapidMiner and need some help. I am working on basket analysis and found this dataset 

http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv

but I struggle with data preparation. The attributes should be constructed from all possible item titles and each transaction should have either true or false in the right column. E.g I need to convert this:

1 milk pastry
2 milk sausage

Into this:

  milk pastry sausage
1 TRUE TRUE FALSE
2 TRUE FALSE TRUE

 

I will appreciate any help.

Regards

Rob

Sort by:
1 - 5 of 51
    User: "lionelderkrikor"
    New Altair Community Member

    Hi @piernik

     

    You had to add the attribute name of your list of grocery in the .CSV file (I named it "Att1" / See attached file)

     

    and here the process for the data preparation : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="8.1.003" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Data_Preparation\groceries.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Att1.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="nominal_to_binominal" compatibility="8.1.003" expanded="true" height="103" name="Nominal to Binominal" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Att1"/>
    <parameter key="transform_binominal" value="true"/>
    <parameter key="use_underscore_in_name" value="true"/>
    </operator>
    <operator activated="false" class="rename_by_replacing" compatibility="8.1.003" expanded="true" height="82" name="Rename by Replacing" width="90" x="447" y="34">
    <parameter key="replace_what" value=".+\.Att1_($|\n)"/>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Nominal to Binominal" to_port="example set input"/>
    <connect from_op="Nominal to Binominal" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    I hope it helps,

     

    Regards,

     

    Lionel 

    User: "lionelderkrikor"
    New Altair Community Member

    Hi again @piernik

     

    Here the process with the renamed attributes (ex : Att1_butter --> butter) : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="8.1.003" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Data_Preparation\groceries.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Att1.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="nominal_to_binominal" compatibility="8.1.003" expanded="true" height="103" name="Nominal to Binominal" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Att1"/>
    <parameter key="transform_binominal" value="true"/>
    <parameter key="use_underscore_in_name" value="true"/>
    </operator>
    <operator activated="true" class="rename_by_replacing" compatibility="8.1.003" expanded="true" height="82" name="Rename by Replacing" width="90" x="514" y="34">
    <parameter key="replace_what" value="\bAtt1_"/>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Nominal to Binominal" to_port="example set input"/>
    <connect from_op="Nominal to Binominal" from_port="example set output" to_op="Rename by Replacing" to_port="example set input"/>
    <connect from_op="Rename by Replacing" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Regards,

     

    Lionel

    User: "piernik"
    New Altair Community Member
    OP

    Hi Lionel,

     

    Many thanks for the solution. Your idea of renaming column is pretty smart! I did not know this operator.

    The solution works well if there is only one attribute. The source .csv file contains multiple transaction lines. I noticed that RapidMiner processes only the first one. For example, some transactions may look like this:

    1. milk, egg, sausage

    2. butter

    3. egg, sugar, cake, water

    When I read the csv file, RapidMiner sees only the first attribute:

    1. milk

    2. butter

    3. egg

    But this is not an issue. I have reformatted the file as a proper comma delimited and added the columns headers 'PurchaseLine01'... (see file attached) Now RapidMiner sees all the attributes.

     

    The problem is how to create a column for each item from all PurchaseLines. So for the example, there are three transactions below, where some customers purchased two, three or four products:

    PurchaseLine01 PurchaseLine02 PurchaseLine03 PurchaseLine04
    milk egg sausage  
    butter  water    
    egg sugar cake water

     

    The output should be:

      milk egg sausage butter sugar cake water
    1 TRUE TRUE TRUE FALSE FALSE FALSE FALSE
    2 FALSE FALSE FALSE TRUE FALSE FALSE TRUE
    3 FALSE TRUE FALSE FALSE TRUE TRUE TRUE

    I think it could be something along these lines:

    1. Create a list of all distinctive items from all transactions and purchase lines

    2. Using the above list, create a column for each product

    3. Map items with TRUE/FALSE depending on what was purchased in a given transaction

     

    I hope this makes sense.

    Many thanks

    Rob

     

     

    User: "lionelderkrikor"
    New Altair Community Member

    Hi @piernik,

     

    Unfortunately, I failed to achieve the transformation of your dataset with RapidMiner ' s native operators.

    So, I used a Python script : 

    Data_Preparation.png

    To execute this process, you have to install the Python environment on your computer and to install

    the Execute Python operator (from the MarketPlace).

    The process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_excel" compatibility="8.1.003" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34">
    <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Data_Preparation\Purchases.xlsx"/>
    <parameter key="imported_cell_range" value="A1:D4"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="PurchaseLine01.true.polynominal.attribute"/>
    <parameter key="1" value="PurchaseLine02.true.polynominal.attribute"/>
    <parameter key="2" value="PurchaseLine03.true.polynominal.attribute"/>
    <parameter key="3" value="PurchaseLine04.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="246" y="34">
    <parameter key="script" value="import pandas as pd&#10;import numpy as np&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;&#10;&#10;def rm_main(data):&#10;&#10; def get_series(string): &#10; return (data == string).T.any() &#10;&#10; cols = np.unique(data.stack().values).tolist() &#10; data_2 = pd.DataFrame(columns=cols, index=range(len(data))) &#10; &#10; for col in cols: &#10; data_2[col] = get_series(col)&#10;&#10; &#10; return data_2"/>
    </operator>
    <connect from_op="Read Excel" from_port="output" to_op="Execute Python" to_port="input 1"/>
    <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    I hope it helps,

     

    Regards,

     

    Lionel

     

    NB : My intimate conviction is that this dataset transformation is possible with RapidMiner (without scripts), so if someone has an idea, I'll be curious to know it.

     

    User: "383742481"
    New Altair Community Member
    o,can you solve this problem now? Please tell me how to do this?thank U