[SOLVED] Clustering(K-Means) data from database
Ruca
New Altair Community Member
Hi all,
Sorry if this problem was already solved, but I’m a newbie and I was not able to locate a similar one.
My problem is the following:
I’ve a table with the following columns: doc_id; term; weight. Basically, for each document there are several terms occurrences and a weight associated to each term. This means that, each document is categorized by a set of pair attributes (term, weight)
Example:
Doc_id term weight
Doc1 color 0,45
Doc1 height 0,22
Doc1 weight 0,05
Doc2 altitude 0,04
Doc2 weight 0,35
I intend to perform a clustering analysis using k-means in order to check which documents are more similar against a predefined k clusters.
When I connect the "read database" operator to the "clustering" operator an error message appears saying that clustering doesn’t accept polynomial attributes. It’s not my intention to change both “doc_id” and “term” attributes to nominal ones. The result that I'm expecting should be somthing similar to:
Cluster_0 (Doc1, Doc32, Docx,...), Cluster_1(Doc_2, Doc45, Docy,...), etc.
Does anyone came across such problem?
Thank you for your support.
Best regards,
Sorry if this problem was already solved, but I’m a newbie and I was not able to locate a similar one.
My problem is the following:
I’ve a table with the following columns: doc_id; term; weight. Basically, for each document there are several terms occurrences and a weight associated to each term. This means that, each document is categorized by a set of pair attributes (term, weight)
Example:
Doc_id term weight
Doc1 color 0,45
Doc1 height 0,22
Doc1 weight 0,05
Doc2 altitude 0,04
Doc2 weight 0,35
I intend to perform a clustering analysis using k-means in order to check which documents are more similar against a predefined k clusters.
When I connect the "read database" operator to the "clustering" operator an error message appears saying that clustering doesn’t accept polynomial attributes. It’s not my intention to change both “doc_id” and “term” attributes to nominal ones. The result that I'm expecting should be somthing similar to:
Cluster_0 (Doc1, Doc32, Docx,...), Cluster_1(Doc_2, Doc45, Docy,...), etc.
Does anyone came across such problem?
Thank you for your support.
Best regards,
Tagged:
0
Answers
-
Hi Ruca,
first of all you have to De-Pivot your data with the equally named operator to get a dataset which contains exactly one document per row, like this:
Then define Doc_id as Id with Set Role, and apply the clustering. That's it
Doc_id color height weight altitude
Doc1 0,45 0,22 0,05 0
Doc2 0 0 0,35 0,04
Best, Marius0 -
Thank you Marius for your support. It worked like a charm.
I've used the PIVOT operator instead of the DE-PIVOT.
Regards,0 -
Oh sorry, of course you have to use Pivot oORuca wrote: I've used the PIVOT operator instead of the DE-PIVOT.
Happy Mining!
~Marius0