Hello,
I'm pretty new to data mining and I would like to hear the opinion of you experts here around, maybe you can help.
I've got a scenario where a shop owner from Hamburg wants to open another store in Berlin and wants to know which city district would be suitable for it.
I have a set of data about urban districts with values about the employment rate, ages of the inhabitants, maritial status, purchasing power,... Here's an example:
*******************
District nameNo. inhabitantsNo. employed inhabitantsAge 0 - 25Age 25 - 50age>50No.singlesNo.married
Berlin Kreuzberg123457608879044720448106791435997658
Berlin Mitte93457308874684716748106795635937618
Berlin Spandau120057668875494719998556892795993618
Hamburg Altona123457608879044720448106791435997658
Hamburg St.Pauli93457308874684716748106795635937618
........................
*******************
There are also a set of customer data from the store in Hamburg (CustomerNo, address, district)
My goal is to determine which district in Berlin is the most suitable for the shop owner to open another shop due to the data set about the districts and his customer data.
My approach would be:
- get the top district of the customer data (e.g. Hamburg St.Pauli)
- determine via cluster analysis which district in Berlin is similar to Hamburg St. Pauli
My questions:
1. Would a clustering analysis be a suitable way to solve this problem?
2. If so, which clustering algorithm is suitable for this kind of data?
3. if not, what other methods would be more suitable?
4. The data set with the district data has many attributes. Is a high number of attributes only a performance issue or is there a danger to get "too much data to analyse"? I have seen that there are some operators in RM5 to remove uninteresting operators.
Thanks
Edit: If this is the wrong forum for this question, I apologize.