"clusteranalysis of unknown data"
shaihulud
New Altair Community Member
Hello Community,
i do want to read data from csv files.
Each line represents an instance with a name and a couple of attributes. The attributes AND attribute values are mostly strings and they can be ARBITRARY.
I need to find a way to identify some representatives for each "group" of instances i have in the data, without knowing the groups ( in other words: i dont know to which classes/concepts/clusters they need to be mapped because, as i said, the data can be arbitrary).
I need to narrow the masses of instances down to representatives as best as i can.
Even though the data is arbitrary they have many similar or equal attribute values.
I think clusteranalysis is the right approach.
I have already experimented with some clustering methods to get some results, it looks promising. Nevertheless i would love to know if you already have experince on such a scenario, so that you can give me a heads up on which clusteranalysis method(s) to focus/start from at best.
Another question would be if you have other ideas than cluster analysis to solve this problem?
I would appreciate any help on the topic.
greetings
i do want to read data from csv files.
Each line represents an instance with a name and a couple of attributes. The attributes AND attribute values are mostly strings and they can be ARBITRARY.
I need to find a way to identify some representatives for each "group" of instances i have in the data, without knowing the groups ( in other words: i dont know to which classes/concepts/clusters they need to be mapped because, as i said, the data can be arbitrary).
I need to narrow the masses of instances down to representatives as best as i can.
Even though the data is arbitrary they have many similar or equal attribute values.
I think clusteranalysis is the right approach.
I have already experimented with some clustering methods to get some results, it looks promising. Nevertheless i would love to know if you already have experince on such a scenario, so that you can give me a heads up on which clusteranalysis method(s) to focus/start from at best.
Another question would be if you have other ideas than cluster analysis to solve this problem?
I would appreciate any help on the topic.
greetings
Tagged:
0
Answers
-
oki ive read a bunch of stuff today and kinda have an idea of what i need to do.
Its impossible for my to avoid using several algorithms in the cluster creating cycle.
So if anybody has specific insight to that scenario, i would appreciate any help or cooperation, but the groundwork is pretty clear to me.0 -
Hi,
just a hint: If you are going to process arbitrary Texts, I would recommend using the Text Processing Extension to build a Word Vector from the texts before clustering them. Otherwise there's no information about any distance between two arbitrary strings.
Greetings,
Sebastian0 -
Hi Sebastian
thx for the hint, but i dont quite understand it. Why is preparing vectors with the attribute values different from just taking the values for clustering? Can you elaborate/ direct me to an elaborating paper/article etc. ?
thx
shai0 -
Hi,
the problem is, that there is no numerical distance defined between the strings "Hallöchen" and "Hi", except the fact that they aren't equal.
If you want to do a good clustering of texts, you will need a distance measure that somehow grasps the equalness of texts. And this is usually done by forming a bag of words. You can google for that and you will probably find many sources.
Greetings,
Sebastian0