A program to recognize and reward our most engaged community members
Hi, I'm looking for the best k for clustering with kmeansFrom the operatorI used the process by distance to DBWill resultI know that the lower the db is, the better kBut I chose the miximaziation markNow how much DB is betterLess or moreI saw thathttps://community.rapidminer.com/t5/RapidMiner-Studio-Forum/choose-best-cluster-number/m-p/44992#M29530But did not helpThanks a lotIf you help me
Hi @m_keshavarz_com,
If you have no idea of the optimal k, you can use the X-Means operator.
Regards,
Lionel
I suggested to apply the "Optimize Parameters" operator to find the best K. Not feasible, sorry for misdirecting, I don't know what I was thinking!
To find the best K, you should check your question first and prepare your parameters for exploration. Perhaps an example might be good:
Let's suppose you have a list of customers buying bus seats but you don't know which ones buy normal seats and which ones buy premium seats. Then you should take a look at your data and see what parameters you have (ticket id, gender, age, origin, destination and type of seat). Your best bet would be k = 2, as you have two types of seats. Then, you should take your next variable, see how many values you have and multiply the current K by that class. Let's say that the next variable is the time required to travel between origin and destination. If you have numerical values that are very variable, you should consider discretizing (that helped me in the past). Rinse and repeat for each variable that makes sense to consider in a cluster.
Clustering will help you understand how your data looks like, but further analyses are required to fully unleash the power of it. I remember that @mschmitz wrote an article on how to use Decision Trees to understand your clusters and I'm keen to recommend it, but couldn't find it.
All the best and sorry for my first post.
Hello Dear friendsAre you goodthank yourfuentealbaYes, but using Optimize Parameters is time consuming and my computer is hangingMaybe with the conditions I'm talking about DB?The higher value represents k is better. Or less?I want to cluster tweets. Now, in your opinion, how much K is better?I did not see an article you said ...you're welcomeThank you
Hi @m_keshavarz_com
Do not use "Optimize Parameters". It was a mistake from my side. I don't know what is your case. The value for k comes from the kind of data you are clustering, that is what I tried to explain. If you explain your use case, we might be able to help. I wrote you a PM.
All the best,
Hi @rfuentealba,
here it is: https://towardsdatascience.com/understanding-clustering-cf0117148ef4
BR,
Martin
and just a friendly reminder @m_keshavarz_com that if your computer is hanging when you start doing things like Optimize Parameters, you are likely pushing against one or more barriers such as single core processing (for a free license). Upgrading your license will likely improve your performance a LOT.
Scott
Hello dear friendsThank you very much for helping meDear Master @mschmitz, I want to cluster tweets and know what similar tweaks are in a cluster?Is the decision tree able to find the best K?How do i do Sorry i know i should not askBut I am a beginnerMay I send a sample process to me?If I use the performance by distance operator. What is a better number for db? I know the DB should be low, but I chose the miximization mark.AndMy system is five-core. How to prevent hang-up?Dear @rfuentealbaI want to cluster tweetsThank you so much for everyoneThankful