Determing new sales regions

Question

Hello,

I'm pretty new to data mining and I would like to hear the opinion of you experts here around, maybe you can help.

I've got a scenario where a shop owner from Hamburg wants to open another store in Berlin and wants to know which city district would be suitable for it.

I have a set of data about urban districts with values about the employment rate, ages of the inhabitants, maritial status, purchasing power,... Here's an example:
*******************District nameNo. inhabitantsNo. employed inhabitantsAge 0 - 25Age 25 - 50age>50No.singlesNo.marriedBerlin Kreuzberg123457608879044720448106791435997658Berlin Mitte93457308874684716748106795635937618Berlin Spandau120057668875494719998556892795993618Hamburg Altona123457608879044720448106791435997658Hamburg St.Pauli93457308874684716748106795635937618........................

*******************

There are also a set of customer data from the store in Hamburg (CustomerNo, address, district)

My goal is to determine which district in Berlin is the most suitable for the shop owner to open another shop due to the data set about the districts and his customer data.

My approach would be:

- get the top district of the customer data (e.g. Hamburg St.Pauli)
- determine via cluster analysis which district in Berlin is similar to Hamburg St. Pauli

My questions:
1. Would a clustering analysis be a suitable way to solve this problem?
2. If so, which clustering algorithm is suitable for this kind of data?
3. if not, what other methods would be more suitable?
4. The data set with the district data has many attributes. Is a high number of attributes only a performance issue or is there a danger to get "too much data to analyse"? I have seen that there are some operators in RM5 to remove uninteresting operators.

Thanks

Edit: If this is the wrong forum for this question, I apologize.

yogafire · Answer

I'm also relatively new in data mining, too.
but here I'm just trying to give my view....

I think your problem is a little bit suitable to classification one... but it doesn't matter if you want to employ clustering technique in this problem, because RM can map a cluster into a classification scheme.

bear in mind just like this,

just consider from your data, about those suitable for building a store, and those that are not in reality, then add a label (e.g suitable, not suitable). this is for training set in classification, and those unlabeled data is your testing set. (e.g. if you consider hamburg st. pauli and hamburg altona as suitable, then add label "suitable", and if you consider hamburg xxx and hamburg yyy is not suitable, add label not suitable, but it all based on reality). and then the berlin xxx data is your id on testing set.

after that you can build a model and validate your model to get the best accuracy. the best model can easily used for prediction, if i look at structure and type of your data, I think it would be wise if you employ neural net modelling, as for the suitable method for clustering... I'm sorry. I'm not good in clustering.

I hope this can help. After All I would like to say sorry for my english, I'm still learning, I'm Indonesian. ;D

regards,

Dimas Yogatama