Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
Cumulative summing
PaulM
I have a large set of examples of financial deposit amounts each with a client id and timestamp. All I want to do is add an attribute to each example that represents the total deposits done by the client up to that timestamp.
The only way I could find to do this was to sort the dataset by timestamp and client id and then loop through it example by example with macros in hold the cumulative total and reset it each time a new client id is encountered. It works but it is VERY VERY slow - it's been running for 15 hours and still going. Most processes on this dataset take seconds....
Surely there has to be a better way of doing this?! (preferably using native operators)
Find more posts tagged with
AI Studio
Accepted answers
BalazsBaranyRM
Hi,
what you describe is trivial to do in a SQL database:
select client_id, timestamp, amount,
sum(amount)
over (partition by
client_id
order by
timestamp
)
as deposits
from yourtable
The
sum() over ()
construct is called "Window function" and it has been in the SQL standard for ~ 20 years now. The concept is so popular and powerful that even NoSQL systems (those with reasonable query languages) tend to implement it.
Here's an excellent introduction to window functions:
https://www.youtube.com/watch?v=D8Q4n6YXdpk
(Applicable to other database systems implementing the standard.)
This example builds partitions (sets of rows that belong together) by the client id and by specifying the order by, you automatically select the application of the aggregation function until the current row inside the current partition. This gives you the running sum.
There is unfortunately no easy way to solve your problem in RapidMiner. If you can, create a local database (SQLite, H2, ...), put your data into a table and use a similar query to get them back.
If you'd like to do it in RapidMiner, it might work faster by looping the client IDs, selecting the current client's subset and doing the cumulative summing inside that loop. (The speed mainly depends on the ratio of clients and transactions.)
(There are operators in Rapidminer related to Windowing but their functionality is very different, as they solve different problems.)
Here's an incomplete implementation of Window Functions (groupwise aggregations but without the running sum functionality that you require) in RapidMiner:
https://github.com/bbarany/rapidminer-windowfunctions
Regards,
Balázs
All comments
BalazsBaranyRM
Hi,
what you describe is trivial to do in a SQL database:
select client_id, timestamp, amount,
sum(amount)
over (partition by
client_id
order by
timestamp
)
as deposits
from yourtable
The
sum() over ()
construct is called "Window function" and it has been in the SQL standard for ~ 20 years now. The concept is so popular and powerful that even NoSQL systems (those with reasonable query languages) tend to implement it.
Here's an excellent introduction to window functions:
https://www.youtube.com/watch?v=D8Q4n6YXdpk
(Applicable to other database systems implementing the standard.)
This example builds partitions (sets of rows that belong together) by the client id and by specifying the order by, you automatically select the application of the aggregation function until the current row inside the current partition. This gives you the running sum.
There is unfortunately no easy way to solve your problem in RapidMiner. If you can, create a local database (SQLite, H2, ...), put your data into a table and use a similar query to get them back.
If you'd like to do it in RapidMiner, it might work faster by looping the client IDs, selecting the current client's subset and doing the cumulative summing inside that loop. (The speed mainly depends on the ratio of clients and transactions.)
(There are operators in Rapidminer related to Windowing but their functionality is very different, as they solve different problems.)
Here's an incomplete implementation of Window Functions (groupwise aggregations but without the running sum functionality that you require) in RapidMiner:
https://github.com/bbarany/rapidminer-windowfunctions
Regards,
Balázs
PaulM
Thanks
@BalazsBarany
- I had a hunch that it might be something that would be easier to move to SQL and it's super helpful to have this confirmed and also for the method. I'll give it a go.
I appreciate that RapidMiner's model reinforces that examples are independent but this feels like quite a common use case in the customer lifetime modelling space so I am surprised there isn't an operator that natively supports this. Would be great addition IMHO.
BalazsBaranyRM
I absolutely agree that this functionality would be a great addition to RapidMiner. It's probably a big project, though.
Going through a database (or having data there in the first place) is not a bad solution though.
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups