Preprocessing market basket data

RobotGirl · August 2018

Hi,

I m a student from Pakistan. I am not much familiar with Rapidminer. I am given a task of market basket analysis and have almost 10,000 rows of data to apply FP-growth and apriori.

My given data is in the format:

1 cheese, bread, milk

2 milk cake

3 cake, cheese, milk

and for apriori algorithm I need to convert data into binary matrix format like:

TID | cheese bread milk cake

1 | 1 1 1 0

2| 0 0 1 1

3| 1 0 1 1

how can I preprocess my data in rapidminer to get this format

thanks in advance

lionelderkrikor · August 2018

Hi @RobotGirl,

For the moment, I don't know how to perform your data transformation with RapidMiner's native operators.(I will think about it)

So waiting, I propose a Python script :

I assume that your initial dataset is under this form :

By executing the process, you obtain a dataset like that :

the process :

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="9.0.001" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34">
        <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Data_Preparation\Purchases_2.xlsx"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Id.true.integer.attribute"/>
          <parameter key="1" value="B.true.polynominal.attribute"/>
          <parameter key="2" value="C.true.polynominal.attribute"/>
          <parameter key="3" value="D.true.polynominal.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="9.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Id"/>
        <parameter key="invert_selection" value="true"/>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="380" y="34">
        <parameter key="script" value="import pandas as pd&#10;import numpy as np&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;&#10;&#10;def rm_main(data):&#10;&#10;  def get_series(string): &#10;    return (data == string).T.any() &#10;&#10;  cols = np.unique(data.stack().values).tolist() &#10;  data_2 = pd.DataFrame(columns=cols, index=range(len(data))) &#10;  &#10;  for col in cols: &#10;    data_2[col] = get_series(col)&#10;&#10;    &#10;  return data_2"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="9.0.001" expanded="true" height="82" name="Generate ID" width="90" x="514" y="34"/>
      <connect from_op="Read Excel" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

To execute this process you need :

- to install Python on your computer

- to install Execute Python operator (from the marketPlace)

I hope it helps,

Regards,

Lionel

RobotGirl · August 2018

thanks for your respond @lionelderkrikor but my task is to use a rapidminer tool without any external coding.

earmijo · August 2018

You can do it directly with the new version of the FP-Growth operator.

Your dataset (CSV file should be like this):

id;basket
1;cheese,bread,milk
2;milk,cake
3;cake,cheese,milk

Please notice the ';'. This is the column separator. So this dataset has only two colums 'id' and 'basket'.

Read it into your repository. It should look like the image below:

Screen Shot 2018-08-18 at 2.01.45 PM.png

Set the first column to the role of ID.

When you use the FP-growth operator make sure that in Input format you select 'items list in a column' and the item separators is set to ','.

Now run the process below:

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.0.001" expanded="true" height="68" name="Retrieve fpgrowth" width="90" x="112" y="85">
        <parameter key="repository_entry" value="//Clases/fpgrowth"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.0.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="85">
        <parameter key="attribute_name" value="id"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="concurrency:fp_growth" compatibility="9.0.001" expanded="true" height="82" name="FP-Growth" width="90" x="514" y="85">
        <parameter key="input_format" value="item list in a column"/>
        <parameter key="item_separators" value=","/>
        <parameter key="min_support" value="0.01"/>
        <enumeration key="must_contain_list"/>
      </operator>
      <connect from_op="Retrieve fpgrowth" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
      <connect from_op="FP-Growth" from_port="example set" to_port="result 1"/>
      <connect from_op="FP-Growth" from_port="frequent sets" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

CarlosQ · August 2018

I got curious about this question.

How would you preprocess the original CSV to replace the first ',' with a ';'?

A few minutes later of googling the answer:

1) Open the CSV in any decent editor (atom,ultraedit,notepad++,etc)

2) Find:

^([^,]*),

3) Replace

$1;

Regex, of course. I should learn more Regex.

MartinLiebig · August 2018

and you don't need to use a editor but can use rapidminer's Replace operator for it

Cheers,

Martin

Preprocessing market basket data

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories