🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

incorporating control groups

User: "jamesbeerbower"
New Altair Community Member
Updated by Jocelyn
Hi,

I'm a newbie to data mining and I'm trying to figure out how to use a control group in the analyse.  e.g. I have 120000 customers who are candidates to receive a mailing of which only 100000 (randomly selected)  do receive the mailing.  The effect that we want to maximize is the difference between the control group and the target group reponse.  I did find a paper exploring the issue from Victor Lo

www.sigkdd.org/explorations/issues/4-2-2002-12/lo.pdf

He has some extensive suggestions -- the essence of which are quoted below.  Can anyone comment on this?  Is there better or different ways? Are there tools available in Rapidminer explicitly to help with control group analysis?

Thanks!
Jamie Beerbower
"Include data, {Yi,Xi} from both the treatment and control
groups in the analysis data set;
2. Assign a dummy variable Ti to 1 for the treatment group and
0 for the control group;
3. Divide the data set into training and hold-out samples;
4. Further divide the training sample into two sub-samples by
Ti, i.e. one is treatment and the other is control;
5. Choose a variable selection method (or called feature
extraction). In each sub-sample (treatment and control), use
the method to narrow down your list of independent
variables, Xi (often an essential step in data mining as there
are normally hundreds of independent variables);
6. Take the union of the two reduced sets of independent
variables from 5 and thus, the new Xi has only q elements,
where q<original number of independent variables, p;
7. Multiply all independent variables, Xi, (from step 6) by Ti to
form the interaction effects, Xi*Ti;
8. Choose a data mining or statistical technique for supervised
learning;
9. Fit a model using Yi as the dependent variable and Xi, Ti, and
Xi*Ti as independent variables;
10. Use stepwise procedure (or similar model selection
procedure) to determine the best parsimonious model.
After the best model is selected, we propose the following
procedure for validation using the holdout sample:
1. For each individual in the hold-out sample, compute the
predicted values of expected Yi for both the treatment and
control, i.e. predict E(Yi|Xi;treatment) and E(Yi|Xi;control);
2. Subtract the control value from the treatment value to
estimate the treatment and control difference (in order to
achieve objective (3));
3. Rank and decile the entire hold-out sample by the predicted
difference;
4. In each decile, compute the observed mean value of Yi’s in
the treatment group and the observed mean value of Yi’s in
the control group and then take the observed difference;
5. Plot the observed difference between treatment and control
by decile to validate the model;
6. The expected true lift can be measured by how much the top
decile(s) perform better than random using the observed
treatment and control difference from step 6."

From The True Lift Model - A Novel Data Mining Approach to
Response Modeling in Database Marketing
Victor S.Y. Lo

Find more posts tagged with