"Clustering text data"

Question

Hi I have a table of text data, specifically contact information (e.g. names, addresses, phone numbers etc). All this data has entered manually, some of it is missing and there is a good chance that there are duplicate examples with slightly different values e.g. different capitalization, spaces in the phone numbers, full name vs initials. I need to scan through this and generate a list of examples which appear to be similar. My first approach was to use the "Data To Similarity" operator to list pairs with a similarity higher than 0.9, but this doesn't quite give the results I expect. This may be partly because I'm not sure which measure type to use, but I think it was also because it didn't take into account things like mismatched case. My second attempt was to use the text processing tools, processing the data using "Process Documents from Data". However, this appears to concatenate all attributes within an example. I'm pretty sure that this is the approach I need to take, but I am stuck on a few points: 1. How do I deal with missing data? Ideally, examples should not be compared on attributes which are missing. 2. As far as I understand, "Process Documents from Data" concatenates attributes, but I want to compare individual attributes in the examples. E.g. two similar names can match, but a name which is similar to an address shouldn't. 3. What model is appropriate for clustering the output from "Process Documents from Data"? I don't know the number of clusters, since it depends on how similar the examples are so I can't use k-means. In a previous attempt with only one attribute I used DBSCAN, which worked well, but took a very long time to process. I don't know how much use this is, but here is the XML as it is at the moment. I have sampled 100 examples for testing.

MariusHelf · Answer

Hi,

maybe you should treat the attributes one by one, i.e. use a Process Documents operator for each attribute. Then you should define custom rules for each attribute, e.g. remove spaces, slashes and dashes from the phone number field, transform names to lowercase etc.

Furthermore, instead of clustering you could also try the Cross-Similarity operator, with the same exampleset connected to both input. That will calculate the similarity of each example to each other example in the set (beware: the new dataset will contain n*n examples, where n is the number of examples in the original data set). The similarities operator should ignore missing values.

Best,
Marius

lovefinearts198 · Answer

Hello,

1.  How do I deal with missing data?  Ideally, examples should not be compared on attributes which are missing.

You can use a component that replace missing data with a value you can set (numerical or nominal)

Regards,