Classification Advise
Greetings to all community members!
I am new at using the very interesting RM and I am working on a classification problem. For an experienced user these might be easy questions and that is why I would like your advice.
I have a data set of 13000 rows. I am trying to predict if a bank is going to sell their product through cellphone or telephone (label = Yes/No) having some data about their clients (age, job, personal loan, house loan, balance etc). I have tried to use Decision Trees, ANN and SVM. But in my question I am only going to ask about Decision Trees.
My data’s types are binominal, polynominal and integers. For some of those attributes there are a lot if missing values which were already filled with “unknown” values in the data that I have. And one of them for example “pdays” -> number of days that passed by, after a client was last contacted from a previous campaign (-1 means client was not previously contacted) it is filled with -1 values when someone was not previously contacted.
What I did first is to run my algorithms without any preprocess to have my first results to use them as baseline (using Cross Validation).
After that and having seen that my data is unbalanced (12% Yes and 88% No), I did both upsampling and downsampling methods to see which one gives the best results. And then I did optimization of the parameters to see some first results. Using upsampling in my case would give the best results.
But I was thinking that I should do some preprocessing as well. What I was thinking of doing are the followings:
1)Missing Values:
There are some attributes that have many unknown values (80%) and others that have just few unknown values (4%). I believe I can’t just take them out, at least not the one with 80%, because I may loose information. But I don’t know if replacing them with the most frequent value or average would be a good solution especially for polynominal attributes. I don’t also know what to do with the value that is -1 (80%) because I think it is not good for my algorithms to keep it. What approach should I think of?
2)Outliers:
I think that I should also check about outliers, because I have for example attribute “balance” [-8000 to 100000] while only one person has balance 100000 and the average is 1350. As I mentioned before I have polynominal and integer types. Should I check for outliers only for integers or also for all the attributes? And what about -1? Doesn’t it influence my results?
3)Normalization:
Because I have data with different ranges I was thinking of normalizing the integers. I have watched tutorials of RM and it was suggested to do normalization in the cross validation in the training set. I did it in the training set in the cross validation where I had balanced my data but I got worse results.
4) Feature Selection:
I used Optimize Selection and I chose backward selection. The attributes that have weight 0 should be taken out. So I took them out and rerun the performance of the decision trees. The accuracy got slightly better, precision improved by 8% but the f measure and recall fell about 20%. Is this normal?
5)Correlation:
I also tried to see if there is any correlation between the attributes. So attribute “duration” has weight 1 but in the correlation matrix I don’t understand with which other attribute is duration correlated because its column has only small values (<0.08). I have the weights only for integer types (total 7). I then tried to do it including all the other attributes by converting Nominal to Binominal but the correlation Matrix was very complicated (48x48)
But the way the results are I don’t understand which attributes to take out. Because for example in the matrix there is in x column marital = married and in y column marital = single with value (-0.772). The attributes that are highly correlated should be taken out?
I have watched many tutorials and read many books (ex. Introduction to Data Mining, Data Mining for the Masses) but theory is a little bit different from real life problems.
I know my message is big but I really need some advice how to approach this problem and I would like to thank you in advance for your time.
Best regards,
Nikos