Cumulative summing
PaulM
New Altair Community Member
I have a large set of examples of financial deposit amounts each with a client id and timestamp. All I want to do is add an attribute to each example that represents the total deposits done by the client up to that timestamp.
The only way I could find to do this was to sort the dataset by timestamp and client id and then loop through it example by example with macros in hold the cumulative total and reset it each time a new client id is encountered. It works but it is VERY VERY slow - it's been running for 15 hours and still going. Most processes on this dataset take seconds....
Surely there has to be a better way of doing this?! (preferably using native operators)
The only way I could find to do this was to sort the dataset by timestamp and client id and then loop through it example by example with macros in hold the cumulative total and reset it each time a new client id is encountered. It works but it is VERY VERY slow - it's been running for 15 hours and still going. Most processes on this dataset take seconds....
Surely there has to be a better way of doing this?! (preferably using native operators)
Tagged:
0
Best Answer
-
Hi,
what you describe is trivial to do in a SQL database:
select client_id, timestamp, amount,
sum(amount) over (partition by client_id order by timestamp) as deposits
from yourtable
The sum() over () construct is called "Window function" and it has been in the SQL standard for ~ 20 years now. The concept is so popular and powerful that even NoSQL systems (those with reasonable query languages) tend to implement it.
Here's an excellent introduction to window functions: https://www.youtube.com/watch?v=D8Q4n6YXdpk (Applicable to other database systems implementing the standard.)
This example builds partitions (sets of rows that belong together) by the client id and by specifying the order by, you automatically select the application of the aggregation function until the current row inside the current partition. This gives you the running sum.
There is unfortunately no easy way to solve your problem in RapidMiner. If you can, create a local database (SQLite, H2, ...), put your data into a table and use a similar query to get them back.
If you'd like to do it in RapidMiner, it might work faster by looping the client IDs, selecting the current client's subset and doing the cumulative summing inside that loop. (The speed mainly depends on the ratio of clients and transactions.)
(There are operators in Rapidminer related to Windowing but their functionality is very different, as they solve different problems.)
Here's an incomplete implementation of Window Functions (groupwise aggregations but without the running sum functionality that you require) in RapidMiner:
https://github.com/bbarany/rapidminer-windowfunctions
Regards,
Balázs1
Answers
-
Hi,
what you describe is trivial to do in a SQL database:
select client_id, timestamp, amount,
sum(amount) over (partition by client_id order by timestamp) as deposits
from yourtable
The sum() over () construct is called "Window function" and it has been in the SQL standard for ~ 20 years now. The concept is so popular and powerful that even NoSQL systems (those with reasonable query languages) tend to implement it.
Here's an excellent introduction to window functions: https://www.youtube.com/watch?v=D8Q4n6YXdpk (Applicable to other database systems implementing the standard.)
This example builds partitions (sets of rows that belong together) by the client id and by specifying the order by, you automatically select the application of the aggregation function until the current row inside the current partition. This gives you the running sum.
There is unfortunately no easy way to solve your problem in RapidMiner. If you can, create a local database (SQLite, H2, ...), put your data into a table and use a similar query to get them back.
If you'd like to do it in RapidMiner, it might work faster by looping the client IDs, selecting the current client's subset and doing the cumulative summing inside that loop. (The speed mainly depends on the ratio of clients and transactions.)
(There are operators in Rapidminer related to Windowing but their functionality is very different, as they solve different problems.)
Here's an incomplete implementation of Window Functions (groupwise aggregations but without the running sum functionality that you require) in RapidMiner:
https://github.com/bbarany/rapidminer-windowfunctions
Regards,
Balázs1 -
Thanks @BalazsBarany - I had a hunch that it might be something that would be easier to move to SQL and it's super helpful to have this confirmed and also for the method. I'll give it a go.
I appreciate that RapidMiner's model reinforces that examples are independent but this feels like quite a common use case in the customer lifetime modelling space so I am surprised there isn't an operator that natively supports this. Would be great addition IMHO.
0 -
I absolutely agree that this functionality would be a great addition to RapidMiner. It's probably a big project, though.
Going through a database (or having data there in the first place) is not a bad solution though.2