"Kmeans clustering in Text data"

Question

Hi,

After applying string tockenizer,stopword filter and tockenlength filter on text data after selecting "Binary occurrence" we are getting all words as numerical attributes and its binary values.My doubt is after selecting these numerical attributes only can we apply KMeans clustering.I tried this method using my data and got a meaningful cluster.But actually I dont know whether it is a good method for text data.More over comparing with KMedoids it consuming very less time.

Thanks
Ratheesan.

land · Answer

Hi,
KMeans uses some properties of the euclidean distance to simplify the KMedoids algorithm. This speeds up calculation, but limits the distance measure to be euclidean. Normally euclidean distance is not the best for high dimensional data text data. Usually the cosine similarity is used. But if you receive meaningful results, everything should be fine and you might go ahead with KMeans.

Greetings,
  Sebastian