Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
"clusteranalysis of unknown data"
shaihulud
Hello Community,
i do want to read data from csv files.
Each line represents an instance with a name and a couple of attributes. The attributes AND attribute values are mostly strings and they can be ARBITRARY.
I need to find a way to identify some representatives for each "group" of instances i have in the data, without knowing the groups ( in other words: i dont know to which classes/concepts/clusters they need to be mapped because, as i said, the data can be arbitrary).
I need to narrow the masses of instances down to representatives as best as i can.
Even though the data is arbitrary they have many similar or equal attribute values.
I think clusteranalysis is the right approach.
I have already experimented with some clustering methods to get some results, it looks promising. Nevertheless i would love to know if you already have experince on such a scenario, so that you can give me a heads up on which clusteranalysis method(s) to focus/start from at best.
Another question would be if you have other ideas than cluster analysis to solve this problem?
I would appreciate any help on the topic.
greetings
Find more posts tagged with
AI Studio
Clustering
Accepted answers
All comments
shaihulud
oki ive read a bunch of stuff today and kinda have an idea of what i need to do.
Its impossible for my to avoid using several algorithms in the cluster creating cycle.
So if anybody has specific insight to that scenario, i would appreciate any help or cooperation, but the groundwork is pretty clear to me.
land
Hi,
just a hint: If you are going to process arbitrary Texts, I would recommend using the Text Processing Extension to build a Word Vector from the texts before clustering them. Otherwise there's no information about any distance between two arbitrary strings.
Greetings,
Sebastian
shaihulud
Hi Sebastian
thx for the hint, but i dont quite understand it. Why is preparing vectors with the attribute values different from just taking the values for clustering? Can you elaborate/ direct me to an elaborating paper/article etc. ?
thx
shai
land
Hi,
the problem is, that there is no numerical distance defined between the strings "Hallöchen" and "Hi", except the fact that they aren't equal.
If you want to do a good clustering of texts, you will need a distance measure that somehow grasps the equalness of texts. And this is usually done by forming a bag of words. You can google for that and you will probably find many sources.
Greetings,
Sebastian
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups