Count examples in overlapping time frame

Hi Sven,

Here a solution using a Python script :
First, at the beginning of the python script, you have to set :
- the names of your attributes
- the date format

As a result, you obtain an exampleset like that :

Image: https://us.v-cdn.net/6030995/uploads/editor/wc/0652g5anwbip.png

The process is in attached file.

Hope this helps,

Regards,

Lionel

Overlaps_Sven.rmp

You're welcome, Sven.

Keep us informed of the results of your work ! (it it's not confidential of course)

Regards,

Lionel

Hi Sven,

Sven, to be honest, I am pessimist.... Here the results of my investigations and experimentations :
I timed the process for 100 examples : the duration is 10 seconds.(my PC = quad-core / 16 Go RAM)
Your whole dataset has around 69000 examples, so I would say in first approximation
that the duration for the whole dataset is (69000*10)/100 = 6900 seconds = around 2 hours. ==> This is obviously not the case.
So I suspect the complexity of this algorithm to be proportional to NxN (and not proportional to N) where N is the number of your examples. In this case, the time duration will be (for the whole dataset) :
10s x 690 x 690 = 4 761 000 seconds = 1322, 5 hours = 55 days !!!!
That 's why the process duration is so long...
Moreover I remember of a thread where an user observed that the execution of a Python script inside RapidMiner is significantly slower than the same script in a Python Notebook. ==> So I will try to execute the script directly in a Python notebook to see if there is an acceleration of the execution.
And finally to answer to your question, the algorithm needs to go through the entire dataset to find the overlaps, thus, from my point of view, it is impossible to run the process by steps...

Regards,

Lionel

Hi,

this is a problem that you can reduce by executing in batches if you can sort and separate your data.

For example, it doesn't make sense to compare records from different years (unless some patients are in care for years).

Instead of making n^2 comparisons (for a large n), you could make 10 joins of (n / 10)^2, which can be some orders of magnitude faster.

Regards,
Balázs

All comments

MartinLiebig

Hi @DocMusher ,

do i understand it correctly, that you have something like:
PatientId DateOfEntry DateOfExit

1 7AM 9AM

2 7AM 10AM

3 9AM 10AM

and want to know at each hour how many patients are in?

Best,

Martin

Hi Martin,
In fact I want to know for each patient how many patients are in at time of admission and at discharge. This might give me insight if length of stay is related with number of patients admitted.
Sven

Hi,

sounds like you are looking for a "generic join" functionality where you can specify the join criterion. In this case, the criterion would be:
a.id <> b.id and overlaps(a.admission, a.discharge, b.admission, b.discharge)
Assuming that both a and b refer to copies of your patient example set (a self join).

Check out this contribution: https://community.rapidminer.com/discussion/33908/generic-join-script
You can write the overlaps() function yourself and specify the above join criterion to self-join your example set, and then aggregate by the a.id to count the number of the joined patients. You'll need to rename the attributes in the second copy before joining.

Here are some example "overlaps" implementations (depends on your requirement, e.g. discharge time could be missing, meaning that the patient is still there):
https://stackoverflow.com/questions/17106670/how-to-check-a-timeperiod-is-overlapping-another-time-period-in-java

Regards,
Balázs

The process is in attached file.

Hope this helps,

Regards,

Lionel

Overlaps_Sven.rmp

Give me a few hours before I can check but already thanks to spend your time 4 this!!!
Sven

lionelderkrikor,

Your python script worked with the example you provided. I considered this as a solution. However with the dataset I am using "sehid" is the (group), "aankomstdt" admissiondatetime and "ontslagdt" dischargedatetime.

With the python script I adapted the variables but the final count is zero for all examples. What am I doing wrong here?

Hopefully you could help me?

Cheers

Sven

Overlapping.csv

overlapping.rmp

Hi Sven,

the final count is zero for all examples

**** !!! my Python script is .... optimistic !!!!!

More seriously, yes, of course, I can help you : In reality "sehid" is your "id" not the new "group" variable I introduced.
The "group" variable allows to build group of patients in order to study the overlaps within each group.
By assimiling the "id" to "group" there is in deed zero for all examples....
In your case, you have apriori only one group.
The bad new is that after execution, the process I shared is raising an error if there is only one group.
The good new is that there is a (far-fetched) workaround.
Grant me one hour, the time I build a new version of the process and check that it works with the data you provided.

Sorry for the inconvenience..

See you soon !

Regards,

Lionel

Sven,

As said in my previous post, I found a workaround.
It consists to extract the first example from your dataset, assigning it group "B"
and append it at the end of your dataset (in order the global dataset has 2 groups)

Given you have a huge dataset and time computation is very long. I used a Filter Example Range operator
to execute the process only on a fraction of your data to check the execution. A priori the process works fine...and gives more relevant results !

To execute the process on your whole dataset, please remove/disable the last Filter Example Range operator (between
the Append operator and the Execute Python operator.

The working process is in attached file.

Keep me informed !

Regards,

Lionel

Overlaps_Sven_2.rmp

Thanks++++++

You're welcome, Sven.

Keep us informed of the results of your work ! (it it's not confidential of course)

Regards,

Lionel

Lionel,

I gave the process a try on the full example range, its running now for 1 day and 20 hours with only 1.2 GB consumption stable over the entire period. What do you think, just let it run untill the finish (in that case, how long would it require by your estimation?) OR there a way to run the process in steps?

Cheers

Sven

Hi Lionel and Balazs

Thanks for the reply, I also thought that overlap is only computable in one batch because patiënt admission is a continuous. Each split of the dataset can bias missing cases that overlap between the subsets. Interesting to "feel" the impact of dimensionality on calculation time. I try to reconstruct how a human brain tries to look for overlap, I wonder if looping with ascending or descending times could not reduce possible combinations. Although theoretically all combinations are possible in overlap, this is only the case if admission and discharge differences are between zero and indefinite which is not realistic. Maybe the number of combinations can be reduced starting from median, average length of stay which could already cover x % of the cases calculated in a fraction of the time?

Thanks anyway!!!!

Sven

Hi Sven,

you can create "fuzzy" batches, like taking 10 % of the data, calculating minimum and maximum entry and exit times, and filtering the candidate dataset accordingly. Then you would remove duplicates that inevitably appear in the result.

Regards,
Balázs

Hi,

What is your opinion on interlaps (https://brentp.github.io/interlap/)?

Regards

Sven