incorporating control groups

Question

Hi,

I'm a newbie to data mining and I'm trying to figure out how to use a control group in the analyse.  e.g. I have 120000 customers who are candidates to receive a mailing of which only 100000 (randomly selected)  do receive the mailing.  The effect that we want to maximize is the difference between the control group and the target group reponse.  I did find a paper exploring the issue from Victor Lo

www.sigkdd.org/explorations/issues/4-2-2002-12/lo.pdf

He has some extensive suggestions -- the essence of which are quoted below.  Can anyone comment on this?  Is there better or different ways? Are there tools available in Rapidminer explicitly to help with control group analysis?

Thanks!
Jamie Beerbower
"Include data, {Yi,Xi} from both the treatment and control
groups in the analysis data set;
2. Assign a dummy variable Ti to 1 for the treatment group and
0 for the control group;
3. Divide the data set into training and hold-out samples;
4. Further divide the training sample into two sub-samples by
Ti, i.e. one is treatment and the other is control;
5. Choose a variable selection method (or called feature
extraction). In each sub-sample (treatment and control), use
the method to narrow down your list of independent
variables, Xi (often an essential step in data mining as there
are normally hundreds of independent variables);
6. Take the union of the two reduced sets of independent
variables from 5 and thus, the new Xi has only q elements,
where q<original number of independent variables, p;
7. Multiply all independent variables, Xi, (from step 6) by Ti to
form the interaction effects, Xi*Ti;
8. Choose a data mining or statistical technique for supervised
learning;
9. Fit a model using Yi as the dependent variable and Xi, Ti, and
Xi*Ti as independent variables;
10. Use stepwise procedure (or similar model selection
procedure) to determine the best parsimonious model.
After the best model is selected, we propose the following
procedure for validation using the holdout sample:
1. For each individual in the hold-out sample, compute the
predicted values of expected Yi for both the treatment and
control, i.e. predict E(Yi|Xi;treatment) and E(Yi|Xi;control);
2. Subtract the control value from the treatment value to
estimate the treatment and control difference (in order to
achieve objective (3));
3. Rank and decile the entire hold-out sample by the predicted
difference;
4. In each decile, compute the observed mean value of Yi’s in
the treatment group and the observed mean value of Yi’s in
the control group and then take the observed difference;
5. Plot the observed difference between treatment and control
by decile to validate the model;
6. The expected true lift can be measured by how much the top
decile(s) perform better than random using the observed
treatment and control difference from step 6."

From The True Lift Model - A Novel Data Mining Approach to
Response Modeling in Database Marketing
Victor S.Y. Lo

Nisa · Answer

Hi steffen
I found the link which u posted very useful.
Thanks for sharing....

jamesbeerbower · Answer

Hi all,

Ingo, thanks for taking the trouble of checking whether Dr. Lo's battle plan is feasible.   Before I go ahead and implement it I need to take a look at the other strategies and (most importantly) get my head around the quality measurement strategy.

Practically I doubt any new interesting hypothesis can come from the process -- we (and most everyone else in the world) simply don't have the quantitiy of data to look at more than one dimension (one factor) at a time.  The difference between "treated" and "untreated" is simply too small in the advertising world.

viele grüße

Jamie Beerbower
Hochheim am Main

steffen · Answer

Hello Jamie

Thank you for the links  :D . This is a really interesting topic.
I want to add this one: http://en.wikipedia.org/wiki/Uplift_modelling, primarily for the terms and groups in customer segmentation.

So little time, so much to learn...

greetings

Steffen