association rules data preparation help needed

New Altair Community Member

Apr 30, 2018

Updated Nov 5, 2024 by Jocelyn

Hi all,

I am new to RapidMiner and need some help. I am working on basket analysis and found this dataset

http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv

but I struggle with data preparation. The attributes should be constructed from all possible item titles and each transaction should have either true or false in the right column. E.g I need to convert this:

1	milk	pastry
2	milk	sausage

Into this:

	milk	pastry	sausage
1	TRUE	TRUE	FALSE
2	TRUE	FALSE	TRUE

I will appreciate any help.

Regards

Rob

Find more posts tagged with

AI Studio

Association Rules + Mining

Getting Started

Sort by:

1 - 5 of 51

lionelderkrikor

New Altair Community Member

Apr 30, 2018

Hi @piernik

You had to add the attribute name of your list of grocery in the .CSV file (I named it "Att1" / See attached file)

and here the process for the data preparation :

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="8.1.003" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
        <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Data_Preparation\groceries.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Att1.true.polynominal.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="nominal_to_binominal" compatibility="8.1.003" expanded="true" height="103" name="Nominal to Binominal" width="90" x="313" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Att1"/>
        <parameter key="transform_binominal" value="true"/>
        <parameter key="use_underscore_in_name" value="true"/>
      </operator>
      <operator activated="false" class="rename_by_replacing" compatibility="8.1.003" expanded="true" height="82" name="Rename by Replacing" width="90" x="447" y="34">
        <parameter key="replace_what" value=".+\.Att1_($|\n)"/>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Nominal to Binominal" to_port="example set input"/>
      <connect from_op="Nominal to Binominal" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

I hope it helps,

Regards,

Lionel

groceries.csv

lionelderkrikor

New Altair Community Member

Apr 30, 2018

Hi again @piernik

Here the process with the renamed attributes (ex : Att1_butter --> butter) :

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="8.1.003" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
        <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Data_Preparation\groceries.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Att1.true.polynominal.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="nominal_to_binominal" compatibility="8.1.003" expanded="true" height="103" name="Nominal to Binominal" width="90" x="313" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Att1"/>
        <parameter key="transform_binominal" value="true"/>
        <parameter key="use_underscore_in_name" value="true"/>
      </operator>
      <operator activated="true" class="rename_by_replacing" compatibility="8.1.003" expanded="true" height="82" name="Rename by Replacing" width="90" x="514" y="34">
        <parameter key="replace_what" value="\bAtt1_"/>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Nominal to Binominal" to_port="example set input"/>
      <connect from_op="Nominal to Binominal" from_port="example set output" to_op="Rename by Replacing" to_port="example set input"/>
      <connect from_op="Rename by Replacing" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Regards,

Lionel

piernik

New Altair Community Member

May 1, 2018

Hi Lionel,

Many thanks for the solution. Your idea of renaming column is pretty smart! I did not know this operator.

The solution works well if there is only one attribute. The source .csv file contains multiple transaction lines. I noticed that RapidMiner processes only the first one. For example, some transactions may look like this:

1. milk, egg, sausage

2. butter

3. egg, sugar, cake, water

When I read the csv file, RapidMiner sees only the first attribute:

1. milk

2. butter

3. egg

But this is not an issue. I have reformatted the file as a proper comma delimited and added the columns headers 'PurchaseLine01'... (see file attached) Now RapidMiner sees all the attributes.

The problem is how to create a column for each item from all PurchaseLines. So for the example, there are three transactions below, where some customers purchased two, three or four products:

PurchaseLine01	PurchaseLine02	PurchaseLine03	PurchaseLine04
milk	egg	sausage
butter	water
egg	sugar	cake	water

The output should be:

	milk	egg	sausage	butter	sugar	cake	water
1	TRUE	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE
2	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	TRUE
3	FALSE	TRUE	FALSE	FALSE	TRUE	TRUE	TRUE

I think it could be something along these lines:

1. Create a list of all distinctive items from all transactions and purchase lines

2. Using the above list, create a column for each product

3. Map items with TRUE/FALSE depending on what was purchased in a given transaction

I hope this makes sense.

Many thanks

Rob

groceries.csv

lionelderkrikor

New Altair Community Member

May 4, 2018

Hi @piernik,

Unfortunately, I failed to achieve the transformation of your dataset with RapidMiner ' s native operators.

So, I used a Python script :

To execute this process, you have to install the Python environment on your computer and to install

the Execute Python operator (from the MarketPlace).

The process :

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="8.1.003" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34">
        <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Data_Preparation\Purchases.xlsx"/>
        <parameter key="imported_cell_range" value="A1:D4"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="PurchaseLine01.true.polynominal.attribute"/>
          <parameter key="1" value="PurchaseLine02.true.polynominal.attribute"/>
          <parameter key="2" value="PurchaseLine03.true.polynominal.attribute"/>
          <parameter key="3" value="PurchaseLine04.true.polynominal.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="246" y="34">
        <parameter key="script" value="import pandas as pd&#10;import numpy as np&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;&#10;&#10;def rm_main(data):&#10;&#10;  def get_series(string): &#10;    return (data == string).T.any() &#10;&#10;  cols = np.unique(data.stack().values).tolist() &#10;  data_2 = pd.DataFrame(columns=cols, index=range(len(data))) &#10;  &#10;  for col in cols: &#10;    data_2[col] = get_series(col)&#10;&#10;    &#10;  return data_2"/>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

I hope it helps,

Regards,

Lionel

NB : My intimate conviction is that this dataset transformation is possible with RapidMiner (without scripts), so if someone has an idea, I'll be curious to know it.

Data_Preparation.png

383742481

New Altair Community Member

Oct 20, 2018

o，can you solve this problem now? Please tell me how to do this?thank U