"clusteranalysis of unknown data"

New Altair Community Member

Nov 9, 2010

Updated Nov 5, 2024 by Jocelyn

Hello Community,

i do want to read data from csv files.
Each line represents an instance with a name and a couple of attributes. The attributes AND attribute values are mostly strings and they can be ARBITRARY.
I need to find a way to identify some representatives for each "group" of instances i have in the data, without knowing the groups ( in other words: i dont know to which classes/concepts/clusters they need to be mapped because, as i said, the data can be arbitrary).
I need to narrow the masses of instances down to representatives as best as i can.
Even though the data is arbitrary they have many similar or equal attribute values.

I think clusteranalysis is the right approach.

I have already experimented with some clustering methods to get some results, it looks promising. Nevertheless i would love to know if you already have experince on such a scenario, so that you can give me a heads up on which clusteranalysis method(s) to focus/start from at best.

Another question would be if you have other ideas than cluster analysis to solve this problem?

I would appreciate any help on the topic.

greetings

Find more posts tagged with

AI Studio

Clustering

Sort by:

1 - 4 of 41

shaihulud

New Altair Community Member

Nov 9, 2010

oki ive read a bunch of stuff today and kinda have an idea of what i need to do.
Its impossible for my to avoid using several algorithms in the cluster creating cycle.

So if anybody has specific insight to that scenario, i would appreciate any help or cooperation, but the groundwork is pretty clear to me.

land

New Altair Community Member

Nov 10, 2010

Hi,
just a hint: If you are going to process arbitrary Texts, I would recommend using the Text Processing Extension to build a Word Vector from the texts before clustering them. Otherwise there's no information about any distance between two arbitrary strings.

Greetings,
Sebastian

shaihulud

New Altair Community Member

Nov 10, 2010

Hi Sebastian

thx for the hint, but i dont quite understand it. Why is preparing vectors with the attribute values different from just taking the values for clustering? Can you elaborate/ direct me to an elaborating paper/article etc. ?

thx
shai

land

New Altair Community Member

Nov 17, 2010

Hi,
the problem is, that there is no numerical distance defined between the strings "Hallöchen" and "Hi", except the fact that they aren't equal.
If you want to do a good clustering of texts, you will need a distance measure that somehow grasps the equalness of texts. And this is usually done by forming a bag of words. You can google for that and you will probably find many sources.

Greetings,
Sebastian

"clusteranalysis of unknown data"

Find more posts tagged with

Quick Links