Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

decision tree vs k-means

I have run a decision tree and K-means in rapidminer, however my results from the two appear to be conflicting each other. I have checked, and my methods appear to be correct.

Is there any possible reason for these contradicting results? I would just like to understand possible reasoning, so I am able to understand further how rapidminer works.

Find more posts tagged with

AI Studio

Classification

Clustering

Decision Tree

k-Means Clustering

Results View

Accepted answers

All comments

sgenzer

hi @sim so I'm a little confused. Decision Tree is a supervised learning algorithm; k-means clustering is an unsupervised learning algorithm. They are literally apples and oranges. How are you using these?

Scott

sim

Hi Scott,
Sorry if I'm being unclear- I'm new to rapidminer and am just trying to understand why my results from these two mechanisms are contradicting each other.

sim

I know that they both belong to different machine learning types, but surely there should be some correlation between the results?

varunm1

Hi @sim

If possible, can you post your xml and sample data to check how they are contradicting? From my understanding k means will cluster the data and decision tree helps interpret the clustering. As an unsupervised algorithm k means just uses numerical data to plot and divide clusters. But the supervised algorithms like decision tree work mainly based on label and not the total data at once. They train to fit their output labels. One big difference is k means consider all attributes where as decision tree drops that are not useful in fitting the output (pruning). You can get similar output if any one attribute is highly related to output. But as @sgenzer the comparison is not suitable between these two

Thanks
Varun

Telcontar120

What the others are saying is that it is unclear what you mean by the statement that the algorithm output is "contradicting" each other since they are not solving the same problem. It would be like saying a recipe for cookies was contradicting instructions for how to change the oil in your car. They are not really doing the same thing at all.

In this case, the DT algorithm will look at your label and then generate a set of splits from all your other attributes that best helps you to best separate the different values of the label.
The k-nn algorithm will simply look at all your data and try to find the number of groups that you specify that are most similar (based on the similarity metric you select) across all the dimensions together. (And if you don't normalize the data and you have numerical data, it can get easily skewed, but that is another story).

I hope this helps clarify.