association rules data preparation help needed

piernik
piernik New Altair Community Member
edited November 5 in Community Q&A

Hi all, 

I am new to RapidMiner and need some help. I am working on basket analysis and found this dataset 

http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv

but I struggle with data preparation. The attributes should be constructed from all possible item titles and each transaction should have either true or false in the right column. E.g I need to convert this:

1 milk pastry
2 milk sausage

Into this:

  milk pastry sausage
1 TRUE TRUE FALSE
2 TRUE FALSE TRUE

 

I will appreciate any help.

Regards

Rob

Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @piernik

     

    You had to add the attribute name of your list of grocery in the .CSV file (I named it "Att1" / See attached file)

     

    and here the process for the data preparation : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="8.1.003" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Data_Preparation\groceries.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Att1.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="nominal_to_binominal" compatibility="8.1.003" expanded="true" height="103" name="Nominal to Binominal" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Att1"/>
    <parameter key="transform_binominal" value="true"/>
    <parameter key="use_underscore_in_name" value="true"/>
    </operator>
    <operator activated="false" class="rename_by_replacing" compatibility="8.1.003" expanded="true" height="82" name="Rename by Replacing" width="90" x="447" y="34">
    <parameter key="replace_what" value=".+\.Att1_($|\n)"/>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Nominal to Binominal" to_port="example set input"/>
    <connect from_op="Nominal to Binominal" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    I hope it helps,

     

    Regards,

     

    Lionel 

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi again @piernik

     

    Here the process with the renamed attributes (ex : Att1_butter --> butter) : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="8.1.003" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Data_Preparation\groceries.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Att1.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="nominal_to_binominal" compatibility="8.1.003" expanded="true" height="103" name="Nominal to Binominal" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Att1"/>
    <parameter key="transform_binominal" value="true"/>
    <parameter key="use_underscore_in_name" value="true"/>
    </operator>
    <operator activated="true" class="rename_by_replacing" compatibility="8.1.003" expanded="true" height="82" name="Rename by Replacing" width="90" x="514" y="34">
    <parameter key="replace_what" value="\bAtt1_"/>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Nominal to Binominal" to_port="example set input"/>
    <connect from_op="Nominal to Binominal" from_port="example set output" to_op="Rename by Replacing" to_port="example set input"/>
    <connect from_op="Rename by Replacing" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Regards,

     

    Lionel

  • piernik
    piernik New Altair Community Member

    Hi Lionel,

     

    Many thanks for the solution. Your idea of renaming column is pretty smart! I did not know this operator.

    The solution works well if there is only one attribute. The source .csv file contains multiple transaction lines. I noticed that RapidMiner processes only the first one. For example, some transactions may look like this:

    1. milk, egg, sausage

    2. butter

    3. egg, sugar, cake, water

    When I read the csv file, RapidMiner sees only the first attribute:

    1. milk

    2. butter

    3. egg

    But this is not an issue. I have reformatted the file as a proper comma delimited and added the columns headers 'PurchaseLine01'... (see file attached) Now RapidMiner sees all the attributes.

     

    The problem is how to create a column for each item from all PurchaseLines. So for the example, there are three transactions below, where some customers purchased two, three or four products:

    PurchaseLine01 PurchaseLine02 PurchaseLine03 PurchaseLine04
    milk egg sausage  
    butter  water    
    egg sugar cake water

     

    The output should be:

      milk egg sausage butter sugar cake water
    1 TRUE TRUE TRUE FALSE FALSE FALSE FALSE
    2 FALSE FALSE FALSE TRUE FALSE FALSE TRUE
    3 FALSE TRUE FALSE FALSE TRUE TRUE TRUE

    I think it could be something along these lines:

    1. Create a list of all distinctive items from all transactions and purchase lines

    2. Using the above list, create a column for each product

    3. Map items with TRUE/FALSE depending on what was purchased in a given transaction

     

    I hope this makes sense.

    Many thanks

    Rob

     

     

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @piernik,

     

    Unfortunately, I failed to achieve the transformation of your dataset with RapidMiner ' s native operators.

    So, I used a Python script : 

    Data_Preparation.png

    To execute this process, you have to install the Python environment on your computer and to install

    the Execute Python operator (from the MarketPlace).

    The process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_excel" compatibility="8.1.003" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34">
    <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Data_Preparation\Purchases.xlsx"/>
    <parameter key="imported_cell_range" value="A1:D4"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="PurchaseLine01.true.polynominal.attribute"/>
    <parameter key="1" value="PurchaseLine02.true.polynominal.attribute"/>
    <parameter key="2" value="PurchaseLine03.true.polynominal.attribute"/>
    <parameter key="3" value="PurchaseLine04.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="246" y="34">
    <parameter key="script" value="import pandas as pd&#10;import numpy as np&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;&#10;&#10;def rm_main(data):&#10;&#10; def get_series(string): &#10; return (data == string).T.any() &#10;&#10; cols = np.unique(data.stack().values).tolist() &#10; data_2 = pd.DataFrame(columns=cols, index=range(len(data))) &#10; &#10; for col in cols: &#10; data_2[col] = get_series(col)&#10;&#10; &#10; return data_2"/>
    </operator>
    <connect from_op="Read Excel" from_port="output" to_op="Execute Python" to_port="input 1"/>
    <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    I hope it helps,

     

    Regards,

     

    Lionel

     

    NB : My intimate conviction is that this dataset transformation is possible with RapidMiner (without scripts), so if someone has an idea, I'll be curious to know it.

     

  • 383742481
    383742481 New Altair Community Member
    o,can you solve this problem now? Please tell me how to do this?thank U