"clusteranalysis of unknown data"
Hello Community,
i do want to read data from csv files.
Each line represents an instance with a name and a couple of attributes. The attributes AND attribute values are mostly strings and they can be ARBITRARY.
I need to find a way to identify some representatives for each "group" of instances i have in the data, without knowing the groups ( in other words: i dont know to which classes/concepts/clusters they need to be mapped because, as i said, the data can be arbitrary).
I need to narrow the masses of instances down to representatives as best as i can.
Even though the data is arbitrary they have many similar or equal attribute values.
I think clusteranalysis is the right approach.
I have already experimented with some clustering methods to get some results, it looks promising. Nevertheless i would love to know if you already have experince on such a scenario, so that you can give me a heads up on which clusteranalysis method(s) to focus/start from at best.
Another question would be if you have other ideas than cluster analysis to solve this problem?
I would appreciate any help on the topic.
greetings
i do want to read data from csv files.
Each line represents an instance with a name and a couple of attributes. The attributes AND attribute values are mostly strings and they can be ARBITRARY.
I need to find a way to identify some representatives for each "group" of instances i have in the data, without knowing the groups ( in other words: i dont know to which classes/concepts/clusters they need to be mapped because, as i said, the data can be arbitrary).
I need to narrow the masses of instances down to representatives as best as i can.
Even though the data is arbitrary they have many similar or equal attribute values.
I think clusteranalysis is the right approach.
I have already experimented with some clustering methods to get some results, it looks promising. Nevertheless i would love to know if you already have experince on such a scenario, so that you can give me a heads up on which clusteranalysis method(s) to focus/start from at best.
Another question would be if you have other ideas than cluster analysis to solve this problem?
I would appreciate any help on the topic.
greetings
Find more posts tagged with
Sort by:
1 - 4 of
41
Hi,
the problem is, that there is no numerical distance defined between the strings "Hallöchen" and "Hi", except the fact that they aren't equal.
If you want to do a good clustering of texts, you will need a distance measure that somehow grasps the equalness of texts. And this is usually done by forming a bag of words. You can google for that and you will probably find many sources.
Greetings,
Sebastian
the problem is, that there is no numerical distance defined between the strings "Hallöchen" and "Hi", except the fact that they aren't equal.
If you want to do a good clustering of texts, you will need a distance measure that somehow grasps the equalness of texts. And this is usually done by forming a bag of words. You can google for that and you will probably find many sources.
Greetings,
Sebastian
Its impossible for my to avoid using several algorithms in the cluster creating cycle.
So if anybody has specific insight to that scenario, i would appreciate any help or cooperation, but the groundwork is pretty clear to me.