Problem with choosing K
Hi, I'm looking for the best k for clustering with kmeans
From the operator
I used the process by distance to DB
Will result
I know that the lower the db is, the better k
But I chose the miximaziation mark
Now how much DB is better
Less or more
I saw that
https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/choose-best-cluster-number/m-p/44992#M29530
But did not help
Thanks a lot
If you help me
Answers
-
Hi @m_keshavarz_com,
If you have no idea of the optimal k, you can use the X-Means operator.
Regards,
Lionel
1 -
Hi,
To let RapidMiner help you choose the best K, you might want to use the "Optimize Parameters" operator. I'm travelling and don't have my computer with me but I'm pretty sure I answered something similar yesterday.
Hope this helps, I'll be back to you once I get my Mac back.
Rodrigo.1 -
Hi @m_keshavarz_com,
I suggested to apply the "Optimize Parameters" operator to find the best K. Not feasible, sorry for misdirecting, I don't know what I was thinking!
To find the best K, you should check your question first and prepare your parameters for exploration. Perhaps an example might be good:
Let's suppose you have a list of customers buying bus seats but you don't know which ones buy normal seats and which ones buy premium seats. Then you should take a look at your data and see what parameters you have (ticket id, gender, age, origin, destination and type of seat). Your best bet would be k = 2, as you have two types of seats. Then, you should take your next variable, see how many values you have and multiply the current K by that class. Let's say that the next variable is the time required to travel between origin and destination. If you have numerical values that are very variable, you should consider discretizing (that helped me in the past). Rinse and repeat for each variable that makes sense to consider in a cluster.
Clustering will help you understand how your data looks like, but further analyses are required to fully unleash the power of it. I remember that @mschmitz wrote an article on how to use Decision Trees to understand your clusters and I'm keen to recommend it, but couldn't find it.
All the best and sorry for my first post.
1 -
Hello Dear friends
Are you good
thank you
rfuentealba
Yes, but using Optimize Parameters is time consuming and my computer is hanging
Maybe with the conditions I'm talking about DB?
The higher value represents k is better. Or less?
I want to cluster tweets. Now, in your opinion, how much K is better?
I did not see an article you said ...
you're welcome
Thank you0 -
Do not use "Optimize Parameters". It was a mistake from my side. I don't know what is your case. The value for k comes from the kind of data you are clustering, that is what I tried to explain. If you explain your use case, we might be able to help. I wrote you a PM.
All the best,
0 -
Hi @rfuentealba,
here it is: https://towardsdatascience.com/understanding-clustering-cf0117148ef4
BR,
Martin
0 -
0
-
and just a friendly reminder @m_keshavarz_com that if your computer is hanging when you start doing things like Optimize Parameters, you are likely pushing against one or more barriers such as single core processing (for a free license). Upgrading your license will likely improve your performance a LOT.
Scott
2 -
Hello dear friends
Thank you very much for helping me
Dear Master @mschmitz, I want to cluster tweets and know what similar tweaks are in a cluster?
Is the decision tree able to find the best K?
How do i do Sorry i know i should not ask
But I am a beginner
May I send a sample process to me
?
If I use the performance by distance operator. What is a better number for db? I know the DB should be low, but I chose the miximization mark.
And
My system is five-core. How to prevent hang-up?
Dear @rfuentealba
I want to cluster tweets
Thank you so much for everyone
Thankful0