Problem with choosing K

m_keshavarz_com
m_keshavarz_com New Altair Community Member
edited November 2024 in Community Q&A

Hi, I'm looking for the best k for clustering with kmeans
From the operator
I used the process by distance to DB
Will result
I know that the lower the db is, the better k
But I chose the miximaziation mark
Now how much DB is better
Less or more
I saw that
https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/choose-best-cluster-number/m-p/44992#M29530
But did not help
Thanks a lot
If you help me

Tagged:

Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @m_keshavarz_com,

     

    If you have no idea of the optimal k, you can use the X-Means operator.

     

    Regards,

     

    Lionel

  • rfuentealba
    rfuentealba New Altair Community Member
    Hi,

    To let RapidMiner help you choose the best K, you might want to use the "Optimize Parameters" operator. I'm travelling and don't have my computer with me but I'm pretty sure I answered something similar yesterday.

    Hope this helps, I'll be back to you once I get my Mac back.

    Rodrigo.
  • rfuentealba
    rfuentealba New Altair Community Member

    Hi @m_keshavarz_com,

     

    I suggested to apply the "Optimize Parameters" operator to find the best K. Not feasible, sorry for misdirecting, I don't know what I was thinking!

     

    To find the best K, you should check your question first and prepare your parameters for exploration. Perhaps an example might be good:

     

    Let's suppose you have a list of customers buying bus seats but you don't know which ones buy normal seats and which ones buy premium seats. Then you should take a look at your data and see what parameters you have (ticket id, gender, age, origin, destination and type of seat). Your best bet would be k = 2, as you have two types of seats. Then, you should take your next variable, see how many values you have and multiply the current K by that class. Let's say that the next variable is the time required to travel between origin and destination. If you have numerical values that are very variable, you should consider discretizing (that helped me in the past). Rinse and repeat for each variable that makes sense to consider in a cluster.

     

    Clustering will help you understand how your data looks like, but further analyses are required to fully unleash the power of it. I remember that @mschmitz wrote an article on how to use Decision Trees to understand your clusters and I'm keen to recommend it, but couldn't find it.

     

    All the best and sorry for my first post.

     

  • m_keshavarz_com
    m_keshavarz_com New Altair Community Member

    Hello Dear friends
    Are you good
    thank you

    rfuentealba

    Yes, but using Optimize Parameters is time consuming and my computer is hanging
    Maybe with the conditions I'm talking about DB?
    The higher value represents k is better. Or less?
    I want to cluster tweets. Now, in your opinion, how much K is better?

    I did not see an article you said ...
    you're welcome
    Thank you

  • rfuentealba
    rfuentealba New Altair Community Member

    Hi @m_keshavarz_com

     

    Do not use "Optimize Parameters". It was a mistake from my side. I don't know what is your case. The value for k comes from the kind of data you are clustering, that is what I tried to explain. If you explain your use case, we might be able to help. I wrote you a PM.

     

    All the best,

     

  • rfuentealba
    rfuentealba New Altair Community Member
    Awesome! Thank you, Martin! @m_keshavarz_com, there you have it.

    Have fun!
  • sgenzer
    sgenzer
    Altair Employee

    and just a friendly reminder @m_keshavarz_com that if your computer is hanging when you start doing things like Optimize Parameters, you are likely pushing against one or more barriers such as single core processing (for a free license). Upgrading your license will likely improve your performance a LOT.  :)

     

    Scott

     

  • m_keshavarz_com
    m_keshavarz_com New Altair Community Member

    Hello dear friends
    Thank you very much for helping me

    Dear Master @mschmitz, I want to cluster tweets and know what similar tweaks are in a cluster?
    Is the decision tree able to find the best K?
    How do i do Sorry i know i should not ask
    But I am a beginner
    May I send a sample process to me
    ?
    If I use the performance by distance operator. What is a better number for db? I know the DB should be low, but I chose the miximization mark.

    And
    My system is five-core. How to prevent hang-up?
    Dear @rfuentealba

    I want to cluster tweets

    Thank you so much for everyone
    Thankful