Generalized Sequential Patterns (GSP) dataset format

abdero
New Altair Community Member
Hello,
i have seen some posts about this subject but i didn't see any good answer.
Can anyone say the format of the input dataset for GSP???
The only format that i have some results (bad ones) is like this:
Client_id, time , feature 1, feature 2, ....
1,1,0,1,0,...
1,2,1,1,1,....
2,1,0,0,0
i have seen some posts about this subject but i didn't see any good answer.
Can anyone say the format of the input dataset for GSP???
The only format that i have some results (bad ones) is like this:
Client_id, time , feature 1, feature 2, ....
1,1,0,1,0,...
1,2,1,1,1,....
2,1,0,0,0
Tagged:
0
Answers
-
Hi,
this is already the correct format, you only need to turn the feature 1, feature 2, ... attributes into binominal ones. Use the Numerical To Binominal for this.
Greetings,
Sebastian0 -
abdero,
Can you post the XML of how you got your data in the format:
Client_id, time , feature 1, feature 2, ....
1,1,0,1,0,...
1,2,1,1,1,....
2,1,0,0,0
Everytime I try to pivot my data from this format:
Customer, Time, Item
1,1,a
1,1,b
1,2,a
2,1,c
etc
I fail to get your format.
Thanks,
Will0 -
Hi, unfortunately, the Pivot operator is currently only capable of grouping by one single attribute, so you have to combine client id and time before the Pivot operator and separate them afterwards. Please have a look at the attached process.
Best regards,
Marius<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.005">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="subprocess" compatibility="5.3.005" expanded="true" height="76" name="Generate Data" width="90" x="45" y="30">
<process expanded="true">
<operator activated="true" class="generate_transaction_data" compatibility="5.3.005" expanded="true" height="60" name="Generate Transaction Data" width="90" x="45" y="30"/>
<operator activated="true" class="set_role" compatibility="5.3.005" expanded="true" height="76" name="Set Role" width="90" x="180" y="30">
<parameter key="name" value="Id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="generate_id" compatibility="5.3.005" expanded="true" height="76" name="Generate ID" width="90" x="315" y="30"/>
<operator activated="true" class="rename" compatibility="5.3.005" expanded="true" height="76" name="Rename" width="90" x="450" y="30">
<parameter key="old_name" value="id"/>
<parameter key="new_name" value="time"/>
<list key="rename_additional_attributes"/>
</operator>
<operator activated="true" class="set_role" compatibility="5.3.005" expanded="true" height="76" name="Set Role (2)" width="90" x="585" y="30">
<parameter key="name" value="time"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<connect from_op="Generate Transaction Data" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Rename" to_port="example set input"/>
<connect from_op="Rename" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Set Role (2)" from_port="example set output" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="generate_concatenation" compatibility="5.3.005" expanded="true" height="76" name="Generate Concatenation" width="90" x="179" y="30">
<parameter key="first_attribute" value="Id"/>
<parameter key="second_attribute" value="time"/>
</operator>
<operator activated="true" class="pivot" compatibility="5.3.005" expanded="true" height="76" name="Pivot" width="90" x="313" y="30">
<parameter key="group_attribute" value="Id_time"/>
<parameter key="index_attribute" value="Item"/>
<parameter key="skip_constant_attributes" value="false"/>
</operator>
<operator activated="true" class="split" compatibility="5.3.005" expanded="true" height="76" name="Split" width="90" x="447" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Id_time"/>
<parameter key="split_pattern" value="_"/>
</operator>
<connect from_op="Generate Data" from_port="out 1" to_op="Generate Concatenation" to_port="example set input"/>
<connect from_op="Generate Concatenation" from_port="example set output" to_op="Pivot" to_port="example set input"/>
<connect from_op="Pivot" from_port="example set output" to_op="Split" to_port="example set input"/>
<connect from_op="Split" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Marius,
Thanks for the timely response, I will examine the code you provided.
Will0 -
Marius,
I actually applied your logic to my SQL and concat'd before rapid miner which speeds up processing.
The trouble I have now is, when I pivot and attempt to replace missing values, that process doesn't work.
I result in a green lighted process but still have '?' values in my pivot table.
Example of my data:
Time_Customer Item Count
1_9 a 1
2_9 b 1
3_9 c 1
3_9 d 1
3_9 e 1
3_9 f 1
3_9 e 1
3_9 b 1
4_9 c 1
4_9 b 1
1_22 c 1
1_27 c 1
1_27 a 1
1_27 g 1
2_27 c 1
2_27 h 1
2_27 g 1
3_27 c 1
My code is below:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.005">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="5.3.005" expanded="true" height="60" name="Read Excel" width="90" x="112" y="30">
<parameter key="excel_file" value="C:\MYFILE"/>
<parameter key="sheet_number" value="2"/>
<parameter key="imported_cell_range" value="A1:C32256"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Time_Customer.true.polynominal.attribute"/>
<parameter key="1" value="Item.true.polynominal.attribute"/>
<parameter key="2" value="Count.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="pivot" compatibility="5.3.005" expanded="true" height="76" name="Pivot" width="90" x="246" y="30">
<parameter key="group_attribute" value="Time_Customer"/>
<parameter key="index_attribute" value="Item"/>
<parameter key="consider_weights" value="false"/>
<parameter key="skip_constant_attributes" value="false"/>
</operator>
<operator activated="true" class="replace_missing_values" compatibility="5.3.005" expanded="true" height="94" name="Replace Missing Values" width="90" x="447" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Time_Customer"/>
<parameter key="include_special_attributes" value="true"/>
<parameter key="default" value="value"/>
<list key="columns"/>
<parameter key="replenishment_value" value="0"/>
</operator>
<operator activated="true" class="split" compatibility="5.3.005" expanded="true" height="76" name="Split" width="90" x="648" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Time_Customer"/>
<parameter key="split_pattern" value="_"/>
</operator>
<operator activated="true" class="rename" compatibility="5.3.005" expanded="true" height="76" name="Rename" width="90" x="782" y="30">
<parameter key="old_name" value="Time_Customer_1"/>
<parameter key="new_name" value="Time"/>
<list key="rename_additional_attributes">
<parameter key="Time_Customer_2" value="Customer"/>
</list>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Pivot" to_port="example set input"/>
<connect from_op="Pivot" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Split" to_port="example set input"/>
<connect from_op="Split" from_port="example set output" to_op="Rename" to_port="example set input"/>
<connect from_op="Rename" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
I greatly appreciate any help you all can offer.
Will
0 -
Hi,
please examine your Replace Missing Values operator. You are replacing the values of only one attribute, but in reality you probably want to replace missing values in *all* attributes, right?
Best regards,
Marius0 -
Marius,
Thank you for your help, I got it to work. The code for reference is provided below. I do have one more snag, the output of the GSP Set works in a Mac OSX install but not in Windows 7.
In the Win7, I see summary data in the results overview tab, but when moving to the GSPSet(GSP) tab, all I see are the annotations options. In the Mac OSX instance, everything appears as one would expect.
Not sure if I should submit a bug report or what.
Thanks for your help!
Will<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.007">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="5.3.007" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
<parameter key="excel_file" value="C:myfile.xls"/>
<parameter key="sheet_number" value="2"/>
<parameter key="imported_cell_range" value="A1:C32256"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Time_Customer.true.polynominal.attribute"/>
<parameter key="1" value="Item.true.polynominal.attribute"/>
<parameter key="2" value="Count.true.binominal.attribute"/>
</list>
</operator>
<operator activated="true" class="pivot" compatibility="5.3.007" expanded="true" height="76" name="Pivot" width="90" x="179" y="30">
<parameter key="group_attribute" value="Time_Customer"/>
<parameter key="index_attribute" value="Item"/>
<parameter key="consider_weights" value="false"/>
<parameter key="skip_constant_attributes" value="false"/>
</operator>
<operator activated="true" class="replace_missing_values" compatibility="5.3.007" expanded="true" height="94" name="Replace Missing Values" width="90" x="313" y="30">
<parameter key="attribute" value="Time_Customer"/>
<parameter key="include_special_attributes" value="true"/>
<parameter key="default" value="value"/>
<list key="columns"/>
<parameter key="replenishment_value" value="0"/>
</operator>
<operator activated="true" class="split" compatibility="5.3.007" expanded="true" height="76" name="Split" width="90" x="45" y="255">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Time_Customer"/>
<parameter key="split_pattern" value="_"/>
</operator>
<operator activated="true" class="rename" compatibility="5.3.007" expanded="true" height="76" name="Rename" width="90" x="179" y="255">
<parameter key="old_name" value="Time_Customer_1"/>
<parameter key="new_name" value="Time"/>
<list key="rename_additional_attributes">
<parameter key="Time_Customer_2" value="Customer"/>
</list>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="5.3.007" expanded="true" height="94" name="Nominal to Numerical" width="90" x="380" y="255">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Time"/>
<parameter key="coding_type" value="unique integers"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="generalized_sequential_patterns" compatibility="5.3.007" expanded="true" height="76" name="GSP" width="90" x="581" y="210">
<parameter key="customer_id" value="Customer"/>
<parameter key="time_attribute" value="Time"/>
<parameter key="min_support" value="0.1"/>
<parameter key="window_size" value="1.0"/>
<parameter key="max_gap" value="18.0"/>
<parameter key="min_gap" value="13.0"/>
<parameter key="positive_value" value="1"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Pivot" to_port="example set input"/>
<connect from_op="Pivot" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Split" to_port="example set input"/>
<connect from_op="Split" from_port="example set output" to_op="Rename" to_port="example set input"/>
<connect from_op="Rename" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="GSP" to_port="example set"/>
<connect from_op="GSP" from_port="patterns" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Hey Will,
are you using the RapidMiner 5.3.7 on both your machines?
Best regards,
Marius0 -
Yes Sir. Updated this morning and it still produces the "error".
Will0 -
I could reproduce that behavior under windows, and it is obviously a bug. I created an internal bug report for that, so no need to submit a bug from your side.
Best regards,
Marius0 -
Outstanding Marius,
Thank you for your assistance!
Will0 -
Marius,
Another question concerning GSP. I receive the same result sets regardless of my Window, Min and Max Gap setting.
My raw data is using days between events as the time element.
Is this a function of the same bug we previously found?
Thanks,
WillMarius wrote:
I could reproduce that behavior under windows, and it is obviously a bug. I created an internal bug report for that, so no need to submit a bug from your side.
Best regards,
Marius0 -
I can't imagine that the the two issues are related.
Did you inspect your data and make sure that the entered values actually would make a difference?
Best regards,
Marius0 -
Hi,
we've just fixed the "empty GSP results" bug. You can either checkout the latest SVN version (see here, updated around midnight) and build RapidMiner yourself, or wait for the next release.
Regards,
Marco0 -
Marco,
Thanks for the response, I'll check my updates!
Will0 -
Hello dear Rapid I developers,
my GSP empty problem still exists till now, how can i update my Rapidminer? or do I need to wait until next official update? Could anyone tell me at what time?
Thank you!
0 -
I am curious as to when the next release will be that covers this as well.
Thanks,
Will0 -
Will, we don't have any release schedule targeted at the great public yet.
Best regards,
Marius0 -
Not to dig up an old topic, but I am still having trouble with the data layout for the GSP operator.
I have combined the time (in day of year format) with my customer ID per your instructions. I have a column for item and a binomial value for the "qty".
When I import the excel sheet, pivot, replace the missing values with value "false" and then split, everything looks good.
When I attempt to convert the split columns for time and customer from nominal to numerical per the GSP operator requirements, my pivot is ruined.
I expect :
Customer, time, item a, item b, ......
1,1,TRUE, FALSE
1,3,TRUE, FALSE
2,4, FALSE, FALSE
etc
however it turns time into multiple columns within the pivot as well.
I can provide a larger example data if required for trouble shooting.
Any help that can be provided is appreciated.
Will<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="5.3.015" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
<parameter key="excel_file" value="C:\Users\me\Desktop\input.xls"/>
<parameter key="sheet_number" value="2"/>
<parameter key="imported_cell_range" value="A1:C7768"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="time_customer.true.polynominal.attribute"/>
<parameter key="1" value="Item.true.polynominal.attribute"/>
<parameter key="2" value="Qty.true.binominal.attribute"/>
</list>
</operator>
<operator activated="true" class="pivot" compatibility="5.3.015" expanded="true" height="76" name="Pivot" width="90" x="45" y="120">
<parameter key="group_attribute" value="time_customer"/>
<parameter key="index_attribute" value="Item"/>
<parameter key="consider_weights" value="false"/>
<parameter key="skip_constant_attributes" value="false"/>
</operator>
<operator activated="true" class="replace_missing_values" compatibility="5.3.015" expanded="true" height="94" name="Replace Missing Values" width="90" x="45" y="210">
<parameter key="include_special_attributes" value="true"/>
<parameter key="default" value="value"/>
<list key="columns"/>
<parameter key="replenishment_value" value="false"/>
</operator>
<operator activated="true" class="split" compatibility="5.3.015" expanded="true" height="76" name="Split" width="90" x="179" y="210">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="time_customer"/>
<parameter key="split_pattern" value="_"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="5.3.015" expanded="true" height="94" name="Nominal to Numerical" width="90" x="313" y="210">
<parameter key="create_view" value="true"/>
<list key="comparison_groups"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Pivot" to_port="example set input"/>
<connect from_op="Pivot" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Split" to_port="example set input"/>
<connect from_op="Split" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Hi
Do we have any operator to apply GSP rules
Thanks0 -
this is a really good question
0