Home
Discussions
Community Q&A
"distance measures of text attributes"
shaihulud
Hi
ive read that the distance measure procedure of the most clusteranalysis algorithm merely looks if the various text attributes of two objects a and b are the same. In other words it measures how many text attributes have the same value. Do they not take string measurements into account? For example: if object a has an attribute x with the value car and object b has the attribute x with the value cars, are they evaluated as a fit?
Btw.: am i right in this section for those kind of questions?
thx for the help.
Find more posts tagged with
AI Studio
Text Mining + NLP
Accepted answers
All comments
shaihulud
Hi Guys
i would really love to read some answers to my question .. furthermore i would like to know if anybody knows if there are distance measure approaches for cluster analysis that take semantics into account. for example an attribute value 'car' will be matched on an attribute calue 'automobile'.
Guys i would really appreciate any help you can give me on this distance measurement topics.
greez
el_chief
Generally what you want to do is calculate the TF-IDF score of a term in a document. This tells you how important a term is with respect to the document it is in, compared to how important the same term is in the rest of your documents.
Then, you would calculate the distance between documents, based on their TF-IDF term scores, generally using the cosine similarity measure.
But, if you're trying to calculate the distance between terms, and not documents, then I would look into the Levenshtein Edit Distance, which I believe, is not (yet) implemented in RapidMiner.
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)