association rules data preparation help needed
Hi all,
I am new to RapidMiner and need some help. I am working on basket analysis and found this dataset
http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv
but I struggle with data preparation. The attributes should be constructed from all possible item titles and each transaction should have either true or false in the right column. E.g I need to convert this:
1 | milk | pastry |
2 | milk | sausage |
Into this:
milk | pastry | sausage | |
1 | TRUE | TRUE | FALSE |
2 | TRUE | FALSE | TRUE |
I will appreciate any help.
Regards
Rob
Answers
-
Hi @piernik
You had to add the attribute name of your list of grocery in the .CSV file (I named it "Att1" / See attached file)
and here the process for the data preparation :
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="8.1.003" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
<parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Data_Preparation\groceries.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Att1.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="nominal_to_binominal" compatibility="8.1.003" expanded="true" height="103" name="Nominal to Binominal" width="90" x="313" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Att1"/>
<parameter key="transform_binominal" value="true"/>
<parameter key="use_underscore_in_name" value="true"/>
</operator>
<operator activated="false" class="rename_by_replacing" compatibility="8.1.003" expanded="true" height="82" name="Rename by Replacing" width="90" x="447" y="34">
<parameter key="replace_what" value=".+\.Att1_($|\n)"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Nominal to Binominal" to_port="example set input"/>
<connect from_op="Nominal to Binominal" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>I hope it helps,
Regards,
Lionel
0 -
Hi again @piernik
Here the process with the renamed attributes (ex : Att1_butter --> butter) :
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="8.1.003" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
<parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Data_Preparation\groceries.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="Att1.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="nominal_to_binominal" compatibility="8.1.003" expanded="true" height="103" name="Nominal to Binominal" width="90" x="313" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Att1"/>
<parameter key="transform_binominal" value="true"/>
<parameter key="use_underscore_in_name" value="true"/>
</operator>
<operator activated="true" class="rename_by_replacing" compatibility="8.1.003" expanded="true" height="82" name="Rename by Replacing" width="90" x="514" y="34">
<parameter key="replace_what" value="\bAtt1_"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Nominal to Binominal" to_port="example set input"/>
<connect from_op="Nominal to Binominal" from_port="example set output" to_op="Rename by Replacing" to_port="example set input"/>
<connect from_op="Rename by Replacing" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Regards,
Lionel
0 -
Hi Lionel,
Many thanks for the solution. Your idea of renaming column is pretty smart! I did not know this operator.
The solution works well if there is only one attribute. The source .csv file contains multiple transaction lines. I noticed that RapidMiner processes only the first one. For example, some transactions may look like this:
1. milk, egg, sausage
2. butter
3. egg, sugar, cake, water
When I read the csv file, RapidMiner sees only the first attribute:
1. milk
2. butter
3. egg
But this is not an issue. I have reformatted the file as a proper comma delimited and added the columns headers 'PurchaseLine01'... (see file attached) Now RapidMiner sees all the attributes.
The problem is how to create a column for each item from all PurchaseLines. So for the example, there are three transactions below, where some customers purchased two, three or four products:
PurchaseLine01 PurchaseLine02 PurchaseLine03 PurchaseLine04 milk egg sausage butter water egg sugar cake water The output should be:
milk egg sausage butter sugar cake water 1 TRUE TRUE TRUE FALSE FALSE FALSE FALSE 2 FALSE FALSE FALSE TRUE FALSE FALSE TRUE 3 FALSE TRUE FALSE FALSE TRUE TRUE TRUE I think it could be something along these lines:
1. Create a list of all distinctive items from all transactions and purchase lines
2. Using the above list, create a column for each product
3. Map items with TRUE/FALSE depending on what was purchased in a given transaction
I hope this makes sense.
Many thanks
Rob
0 -
Hi @piernik,
Unfortunately, I failed to achieve the transformation of your dataset with RapidMiner ' s native operators.
So, I used a Python script :
To execute this process, you have to install the Python environment on your computer and to install
the Execute Python operator (from the MarketPlace).
The process :
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="8.1.003" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34">
<parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Data_Preparation\Purchases.xlsx"/>
<parameter key="imported_cell_range" value="A1:D4"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="PurchaseLine01.true.polynominal.attribute"/>
<parameter key="1" value="PurchaseLine02.true.polynominal.attribute"/>
<parameter key="2" value="PurchaseLine03.true.polynominal.attribute"/>
<parameter key="3" value="PurchaseLine04.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="246" y="34">
<parameter key="script" value="import pandas as pd import numpy as np # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): def get_series(string): return (data == string).T.any() cols = np.unique(data.stack().values).tolist() data_2 = pd.DataFrame(columns=cols, index=range(len(data))) for col in cols: data_2[col] = get_series(col) return data_2"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Execute Python" to_port="input 1"/>
<connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>I hope it helps,
Regards,
Lionel
NB : My intimate conviction is that this dataset transformation is possible with RapidMiner (without scripts), so if someone has an idea, I'll be curious to know it.
0 -
o,can you solve this problem now? Please tell me how to do this?thank U0