🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"clusteranalysis of unknown data"

User: "shaihulud"
New Altair Community Member
Updated by Jocelyn
Hello Community,

i do want to read data from csv files.
Each line represents an instance with a name and a couple of attributes. The attributes AND attribute values are mostly strings and they can be ARBITRARY.
I need to find a way to identify some representatives for each "group" of instances i have in the data, without knowing the groups ( in other words: i dont know to which classes/concepts/clusters they need to be mapped because, as i said, the data can be arbitrary).
I need to narrow the masses of instances down to representatives as best as i can.
Even though the data is arbitrary they have many similar or equal attribute values.

I think clusteranalysis is the right approach.

I have already experimented with some clustering methods to get some results, it looks promising. Nevertheless i would love to know if you already have experince on such a scenario, so that you can give me a heads up on which clusteranalysis method(s) to focus/start from at best.

Another question would be if you have other ideas than cluster analysis to solve this problem?

I would appreciate any help on the topic.

greetings

Find more posts tagged with

Sort by:
1 - 4 of 41
    User: "shaihulud"
    New Altair Community Member
    OP
    oki ive read a bunch of stuff today and kinda have an idea of what i need to do.
    Its impossible for my to avoid using several algorithms in the cluster creating cycle.

    So if anybody has specific insight to that scenario, i would appreciate any help or cooperation, but the groundwork is pretty clear to me.
    User: "land"
    New Altair Community Member
    Hi,
    just a hint: If you are going to process arbitrary Texts, I would recommend using the Text Processing Extension to build a Word Vector from the texts before clustering them. Otherwise there's no information about any distance between two arbitrary strings.

    Greetings,
      Sebastian
    User: "shaihulud"
    New Altair Community Member
    OP
    Hi Sebastian

    thx for the hint, but i dont quite understand it. Why is preparing vectors with the attribute values different from just taking the values for clustering? Can you elaborate/ direct me to an elaborating paper/article etc. ?

    thx
    shai
    User: "land"
    New Altair Community Member
    Hi,
    the problem is, that there is no numerical distance defined between the strings "Hallöchen" and "Hi", except the fact that they aren't equal.
    If you want to do a good clustering of texts, you will need a distance measure that somehow grasps the equalness of texts. And this is usually done by forming a bag of words. You can google for that and you will probably find many sources.

    Greetings,
      Sebastian