Correlation Matrix is a weak. Decision tree accuracy 70%

mariozupan · October 2012

I have financial performance indicators as interdependent variables and the financial performance mark as the label. I tried to use correlation matrix operator and got very weak correlations between label and the indicators, although marks (from A to E) are derived from indicators. Do I need optimization of the parameters, or some other type of optimization? Do I need normalization of variables? Do I need discretization?
The same questions stays for decision tree. I got 70% accuracy with the pre-pruning disabled .
I was mentioned correlation matrix before decision tree because it logic to me that I need a very strong correlation before any learning operator. Correct me if I'm wrong
Could you please show me the way.

MariusHelf · October 2012

Hi,

you don't need necessarily a strong correlation for good prediction results: correlation measures the impact of each single attribute on the label, but it does not catch attribute interactions. Suppose you have a shop and want to find good customers. It may be possible that alone by the age of a customer you can't tell anything and alone by the city you can't tell anything, but if you combine both attributes, you will see that old customers from New York by a lot, and young customers from Seattle.
So here the predictive strength comes only from the combination of two attributes. This is not represented in the correlation matrix.

For the accuracy: please keep in mind, that this is the probability for new examples to be classified correctly. If you have equally distributed data, anything above 50% is better than random guessing. However, if you have 70% positives in your data, and your learner always predicts "positive", you will already have an accuracy of 70%. So the accuracy must always be interpreted in combination with the class priors.

Happy Mining!
~Marius

Correlation Matrix is a weak. Decision tree accuracy 70%

Answers

Categories