Clustering in rapidminer

Hello.!! I make a project in rapidminer and I 've got a question. My question is how can I find the representative consumer based in demographic data after having clustered the group of consumers with criterion the ratings in products.??? I will be waiting for some help.I appreciate it if someone could help me.!!

Find more posts tagged with

AI Studio

Clustering

Accepted answers

All comments

MartinLiebig

Hi,

the clustering model contains a centeroid table. In this centeroid table you can see, what the center points of your cluster were. You might want to use them as representative (in the end the centeroid is the best representative of a cluster).

If you want to have something like "What is most the most important attribute for Cluster X?" you might use the Cluster-ID as label for a supervised learning algorithm and then do a standard feature selection.

Best,

Martin

nicka

Thank you very much...

nicka

Thank you very much... !!Your help is really important.....I want to ask something else.....It is a question....what products should propose to a "new" customer for which only knows the assessment for a given product. The only data is given to us is the assessments for the products by the users... I think we should do something with recommendation system...How can I use recommendation systems in rapidminer, if this is the right way????

MartinLiebig

The basic question is, if you have a supervised learning problem.

Do you have a data set where you have the "truth"? Than you can simply use a classificator.

Otherwise you might want to find items which are usually bought together. Have a look at the FP-Growth operator and it's tutorial in this case.

nicka

Another solution I have thought is that we can see in which cluster is that product (the product which customer has assessed) and we can recommend the products which are in this cluster....???
One more question we did a classification and accuracy of classification is very low etc. 30%/+-15%, 50%/+-15% ... We have used naive bays, decision tree and K-nn but the accuracy is also low... What can we do to improve our model accuracy?????

MartinLiebig

Hello nicka,

of course you can analyse the cluster belongings. The question is how to find the "important" attributes. If you use the cluster_id as a label you can use weight by svm to find the key attributes.

For the classification problem. There are several typical things you do to optimize the performance:

0. Feature Generation and preprocessing - E.g. converting dates to useful numbers, calculating differences etc.
1. Feature Selection
2. Choosing the different algorithm. I would try for: SVM (with different Kernels), Random Forest, Neural Net, Linear Regression, Boosted decision Tree, LDA..
3. Optimizing the parameters of the algorithm (C for SVM is very very important).

As described by the CRISP-DM (http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining ) cycle it is a cykle. so you might turn back to the data again.
Data science is nothing like "do that and be happy". Good data science is kind of an art.

Can you share the data and/or the processes? Than someone might have a look on it and give more detailed tips.

Best,

Martin

nicka

Thank you..!!We have \downloaded (manually) from tripadvisor a number recent reviews of a particular hotel.We entered the data in an excel file for each review note if positive, negative or neutral based on the rating given by the user himself (negatively consider ratings with values 1-2, 3, 4-5 neutral and positive).
1. W should apply text processing functions that will lead to the largest possible reduction in the number of features (words) describing the vector reviews, 2 We should develop model classification which can rank (classify) the three categories new paradigms reviews (positive, negative, neutral) and evaluate the accuracy of classification by trying different algorithms. Which choice we should select for your recommendationsin order to optimize the performance of the model?????

MartinLiebig

Ahhh it's text classification!

Then i would try 3 different algorithms: Radial SVM, k-NN with cosine similarity and naive bayes.

Did you use stemming and pruning?

nicka

We can find SVM butv we can't find Radial SVM .....How can we select SVM Radial???Is there a choice somewhere??? The same applies for K-NN with cosine similarity...How can we choose cosin similarity???? We have chosen operator stemming....Pruning is an operator????

MartinLiebig

Radial means that you use a radial kernel. So simply change the option to radial
cosine similarity as a distance measurement. When using k-NN you need to define one. Cosine similarity works quite good on text data.

Pruning is an option of process documents.