Data mining case: your advice wanted
Hi guys,
following is a description of a real dataset study done by the author. I'd like to get a feedback what is done good/what should be improved. I hope problem would be interesting for you to think of, and thank you in advance for any ideas.
Data overview. Credit data set with 10k examples and 500 attributes. Credits were issued to wide range of extremely risk customers and total default rate is about 50%. Credit applications were done online, thus a lot of data points from various sources collected, resulting in 500 attributes with information gain weight > 0.001 and correlation < 0.95. Attributes come from naturally different data sources, however, we can’t say they are independent. Correlation between attributes from different sources may be described as significant by nature, because good customers have a full set of “positive predictors” and bad customers have a full set of “negative predictors”. The task is to build classification model better separating good and defaulted customers.
Weightings. If we get attributes' information gain weights, sort it descending order and calculate a running total (sum), proportions would be the following: top 50 attributes contain 50% of total information, top 100 attributes – 70%, top 250 – 95%. (Maybe summing weights is not such a good idea, however, I hope it gives a correct brief overview). Thus, it seems there is almost no chance to improve model accuracy with further attribute selection after top 250 was included, and we should mostly work with top 250 attributes.
Some attributes characterize label only within a segment, thus attribute selection by gain ratio may work better. Combining weights of information gain and gain ration were tried, including their: average (also with different proportions), maximum and product. Building model with top 100 attributes by product of weights seemed to work a slightly better than by weights of information gain only, but not significant.
Base learner. As a base learner Random Forest (Weka) was selected, due to overall efficiency, speed, and less need to prepare data. SVM and non-linear logistic regressions tests & tune ups were tried, but while execution time was not higher x5 of Random Forest, they showed significantly worse results. Thus all further approaches were tested with Weka Random Forest.
Modeling. 5 times looped and averaged 4-fold x-validation with 100-tree Weka Random Forest (and default k and depth) gives good results, improving accuracy while selecting top weighted attributes up to 100. After top 100 attributes, accuracy stops improving, and the question is how to obtain information remaining in rest 150 attributes.
Tuning up RF params K or Depth does not give any noticeable improvement. If default K for 100 atts would be 7, we try K from 7 to 20 with 100 and 200 trees. As Breiman described, in case of independent atts increasing K should give an effect, but we can’t notice it with our data. Probably, number of trees should be greater when increasing K, should 200 be enough for 250 atts?
PCA approach. As an approach to obtain information from low-weighted atts, let’s try to reduce space and improve variable variance by Principal Components. PCA is applied separately to each data source in order not to lose information. But even though, it reduces model accuracy comparing to original atts, so we keep top N original atts and only apply PCA to the rest atts, then we join original and transformed atts. With one data source, this approach gave noticeable result when we build model using this data source only. However, joining this data with all the rest attributes (all data sources) does not improve our general model accuracy. We vary parameter N in hope to find a good split point of keeping original atts and generating new, but as a result we still don’t get noticeable improvement.
Boosting approach. Maybe, it’s by nature obvious that RF cannot be boosted. However, we try all implemented boosting operators to check this, and we mostly get significantly worse results. Maybe boosting improperly configured – is it possible to boost RF?
Optimize attributes selection. We have done atts selection optimization due to high computing time. Probably, only full brute force optimization may give improvement and, as seems, not significant. Your advice on whether optimizing selection may give improvement would be useful.
Concluding all above, Random Forest seems greatly efficient, our dataset seems well suited for it. But we stop gaining improvement after top 100 attributes selected by weights. This may mean that no new information is contained in the rest attributes, however, we have no proof of it.
General question is which approaches should be tried for improvement, or, if no improvement is possible further, how to prove it.
Since the study is done by a newbie, any your feedback, ideas and experience would be very helpful!
Thanks,
and special thanks to RapidMiner team for a great analytic tool.