Home
Discussions
Community Q&A
Sampling or filtering transactions down to last X per customer
PaulM
I have a large exampleset with every single transaction completed by each customer over a long time frame. I want to filter this down to the last X transactions that each customer did during their lifetime (e.g. up to the last 5 - if they did fewer than this then at least the ones they did).
I really don't want to have to loop through customer by customer as performance will be dire. Is there a smarter/simpler way of doing this? Each transaction has a timestamp and a client id.
Thanks in advance for any suggestions.
Find more posts tagged with
AI Studio
Sampling
Filtering
Accepted answers
All comments
PaulM
Thanks
@sgenzer
but won't that just get me the last X transactions from all clients rather than the last X transactions done by each client?
I have a process that loops through each client ID, filters the exampleset down to just the transactions for that client and does a filter example range for each one but as you can imagine performance is pretty darn bad.
Any other possible workarounds? (e.g. a smart way of numbering the transaction sequence per client and then just filtering the whole lot for sequence numbers <= X)?
MartinLiebig
Hi
@PaulM
,
this is actually the use case we've build Group Into Collection for. You can first Group by Client ID and then use Loop Collection + Sort and FIlterExamples Range to do this.
If you need to do it in parallel, you can use Extract Macro (Format) to get the size of the collection and then the Normal Loop with Select to process it in parallel manner.
Attached is an example showing how to get the 3 lowest passenger fairs for each Passenger Class of the titanic data set.
Best,
Martin
<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.6.000" expanded="true" height="68" name="Retrieve Titanic" width="90" x="112" y="34">
<parameter key="repository_entry" value="//Samples/data/Titanic"/>
</operator>
<operator activated="true" class="operator_toolbox:group_into_collection" compatibility="2.4.000-SNAPSHOT" expanded="true" height="82" name="Group Into Collection" width="90" x="246" y="34">
<parameter key="group_by_attribute" value="Passenger Class"/>
<parameter key="group_by_attribute (numerical)" value=""/>
<parameter key="sorting_order" value="none"/>
</operator>
<operator activated="true" class="loop_collection" compatibility="9.6.000" expanded="true" height="82" name="Loop Collection" width="90" x="380" y="34">
<parameter key="set_iteration_macro" value="false"/>
<parameter key="macro_name" value="iteration"/>
<parameter key="macro_start_value" value="1"/>
<parameter key="unfold" value="false"/>
<process expanded="true">
<operator activated="true" class="sort" compatibility="9.6.000" expanded="true" height="82" name="Sort" width="90" x="45" y="34">
<parameter key="attribute_name" value="Passenger Fare"/>
<parameter key="sorting_direction" value="decreasing"/>
</operator>
<operator activated="true" class="filter_example_range" compatibility="9.6.000" expanded="true" height="82" name="Filter Example Range" width="90" x="179" y="34">
<parameter key="first_example" value="1"/>
<parameter key="last_example" value="3"/>
<parameter key="invert_filter" value="false"/>
</operator>
<connect from_port="single" to_op="Sort" to_port="example set input"/>
<connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
<connect from_op="Filter Example Range" from_port="example set output" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve Titanic" from_port="output" to_op="Group Into Collection" to_port="exa"/>
<connect from_op="Group Into Collection" from_port="col" to_op="Loop Collection" to_port="collection"/>
<connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
PaulM
Many thanks
@mschmitz
, sounds like a good solution. My client does not have 9.6 installed yet so I can't test the group_into_collection operator against their datasets just yet, but as soon as we get that sorted out I'll give it a go.
MartinLiebig
Hi
@PaulM
,
group into collection does not requiere 9.6. It is part of Operator Toolbox extension. I think the current requierement is 8.2.
Best,
Martin
PaulM
Hi
@mschmitz
,
Ah ok, good to know - I misinterpreted your process XML. Client is rather strict about installation of extensions as well and that one is not currently installed - will have to go through their process to get it so unfortunately there will be a lag before I can test it.
Thanks again.
Paul
sgenzer
hi
@PaulM
yes sorry I posted a quick solution, realized that it did not answer your question, and then deleted my post.
And of course
@mschmitz
came up with a better solution anyway.
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)