I teach a course of Data Mining in an MBA program. I have done it for several years now and I use RapidMiner as the main software program.
This year I want to introduce the topic of Survival Analysis in Data Mining. The main application is to model customer retention. I have searched this forum and I have concluded that the standard models for doing SA are not available and will not be available anytime soon.
That was bad news for me because I don't want to use two packages (I could use R). And then.... I found this magnificent paper by Singer & Willet on Discrete-Time Survival Analysis.
http://gseacademic.harvard.edu/~willetjo/pdf%20files/Singer%20&%20Willett%201993.pdf Bottom line: All you need is Logistic Regression. So far so good. There is a little problem. The dataset has to be put a specific format (the so called person-period format).
I'll explain with an example:
Suppose I have the following dataset:
id,month,event,x1,x2
1,5,0,0.19,0.65
2,6,1,0.41,0.33
3,7,0,0.22,0.79
4,8,1,0.56,0.91
5,9,0,0.71,0.36
id = patient's id
months = months to event or censoring time
event = 1 if event (death for instance) occurred , 0 if censored (at the time study finished event hadn't taken place)
x1, x2 are potential explanatory variables.
To be able to run the model suggested by Willet & Singer I need that dataset in the format below.
id,month,event,x1,x2
1,1,0,0.19,0.65
1,2,0,0.19,0.65
1,3,0,0.19,0.65
1,4,0,0.19,0.65
1,5,0,0.19,0.65
2,1,0,0.41,0.33
2,2,0,0.41,0.33
2,3,0,0.41,0.33
2,4,0,0.41,0.33
2,5,0,0.41,0.33
2,6,1,0.41,0.33
3,1,0,0.22,0.79
3,2,0,0.22,0.79
3,3,0,0.22,0.79
3,4,0,0.22,0.79
3,5,0,0.22,0.79
3,6,0,0.22,0.79
3,7,0,0.22,0.79
4,1,0,0.56,0.91
4,2,0,0.56,0.91
4,3,0,0.56,0.91
4,4,0,0.56,0.91
4,5,0,0.56,0.91
4,6,0,0.56,0.91
4,7,0,0.56,0.91
4,8,1,0.56,0.91
5,1,0,0.71,0.36
5,2,0,0.71,0.36
5,3,0,0.71,0.36
5,4,0,0.71,0.36
5,5,0,0.71,0.36
5,6,0,0.71,0.36
5,7,0,0.71,0.36
5,8,0,0.71,0.36
5,9,0,0.71,0.36
We want to create a separate observation for each period that each
person was observed, up to the year in which a patient
change occurred.
Thus persons who died in
year 1 contributed 1 person-year each; those who died
in year 6 (like individual 2) contributed 6 person-years.
The value of the variable event is 0 for the first 5 periods and
1 for the sixth period.
Censored individuals (those who were still alive at the study) as many periods as they were observed.
For instance, individual 5, contributes 5 periods. For all the periods observed
the variable event takes the value of 0.
Help is greatly appreciated.