"distance measures of text attributes"

Hi

ive read that the distance measure procedure of the most clusteranalysis algorithm merely looks if the various text attributes of two objects a and b are the same. In other words it measures how many text attributes have the same value. Do they not take string measurements into account? For example: if object a has an attribute x with the value car and object b has the attribute x with the value cars, are they evaluated as a fit?

Btw.: am i right in this section for those kind of questions?

thx for the help.

Find more posts tagged with

AI Studio

Text Mining + NLP

Accepted answers

All comments

shaihulud

Hi Guys

i would really love to read some answers to my question .. furthermore i would like to know if anybody knows if there are distance measure approaches for cluster analysis that take semantics into account. for example an attribute value 'car' will be matched on an attribute calue 'automobile'.

Guys i would really appreciate any help you can give me on this distance measurement topics.

greez

el_chief

Generally what you want to do is calculate the TF-IDF score of a term in a document. This tells you how important a term is with respect to the document it is in, compared to how important the same term is in the rest of your documents.

Then, you would calculate the distance between documents, based on their TF-IDF term scores, generally using the cosine similarity measure.

But, if you're trying to calculate the distance between terms, and not documents, then I would look into the Levenshtein Edit Distance, which I believe, is not (yet) implemented in RapidMiner.