Donor Analysis

Question

Hi!

I'm doing a donor (customer) analysis for my master-thesis and I hope you can help me, as I'm not very deep into RapidMiner. I have data from three different departments (dialogue marketing, campaign team, online marketing) of a NPO, as they don't have a central data warehouse yet. I already managed to match the three data sheets and did some data preparation.

My problem now is that I don't know which my final operator will be and therefore what my next steps are.
I have following data from the donors: donorID, e-mail, zipcode, gender (man/woman/family), creation date, product status (we differ 9 products, e.g. "godfather", "member", "protector"), origin (e.g. "internet", "mailing"), total dontation, number of donations and date of birth.

I want to find new insights in the data. There was never an analysis of the complete data. The three departments have different goals. The dialogue marketing team tries to get high amounts of donations. The campaign team wants a lot of signatures for petitions. The online marketing team wants the people to subscribe for the newsletter. I want to find the donors who donated the biggest amount of money. Maybe donors who are also subscribed to our newsletter donate more money, or maybe not. Maybe donors who are above 40, signed a petition and are from a specific region donate a lot of money.

Is it better to have different data sheets (e.g. matched donors from dialogue and online marketing team) or use only one big one (with columns: newsletter TRUE/FALSE, campaign TRUE/FALSE). Which operators should I use to analyse the data?

I also have some questions for data preparation. I want to transform the date of birth in age. Is there an operator who calculates the age, using the current date? Is there an operator I can use to generate age groups (e.g. 18-25, 26-35, 36-45, ...)?
The zipcodes consist of five numbers (Germany). To get a bigger region, I'd like to use only the first two numbers. Which operator can I use to cut the four last numbers?

Thanks in advance for your help!
Tim

timgras89 · Answer

I still have problems with my analysis.

1.) For clustering, I have to normalize my data. But how can I normalize text? I have different products, but saying product 1 is "0.1" and product 4 is "0.4" doesn't make sense. How do I handle with that? Is it correct normalization, when I set a "0" for "men" and a "1" for "women"? My region attribute consists of the first two numbers of the zip code (Germany). How can I normalize that?

2.) I want to find clusters with customers who spend the most of money. Do I have to label my "money spent" attribute? When I do that, I can't see, how much money these clusters spend in average. How can I see that?

3.) I'd like to see a scatter plot, to have a graphical overview of my clusters. In my process, I can't see that, but why?

I added my process xml. I only wanted to watch at the attributes "money spent", "region" (first two numbers of zip code), "products" and "gender".

Process.xml

SGolbert · Answer

I think that Generalized Linear Models could be a good fit for your problem, provided you have a threshhold or cathegories for the different amounts of investments (you may have to cathegorize the label attribute). You get interpretable coefficients out of it, which is a big plus.

MartinLiebig · Answer

Hi Tim,

i would first consider to turn this whole problem into a supervised learning problem. One might be: Predict how much a donor is willing to give.

This information can be used if you recruit a new donor. It might also be used to target "under-performing" donors.

Best,

Martin