"How to test the assumptions of logistic regression in RM?"
Hello ! I am getting a bit lost in my analysis and I am stuck between two steps of my methodology. I had a classification problem (churn) with a dataset of 100 variables for 100 000 examples and, after removing attributes with too many missing values and those that were too correlated to others (pairs with a correlation above 75%) or with a too small variance, I have 46 attributes and 80 000 examples in my training set.
However, I'd like to have between 10-15 attributes approximately and to do so, I'm using a backward elimination. This means that I have to already choose a predictive model and for now, I chose either Random Forest or Logistic Regression. My first question would be: which backward elimination model choosing knowing that the results are much different whether I do it with a logistic regression or with Random Forest? Then, I have mainly 2 questions:
1. ASSUMPTIONS BEHIND MODELS
For logistic regression, if I choose this one, I am not sure how to verify the assumptions linked to this model in Rapid Miner:
2. CHOOSING THE MODEL - Blackbox models
Finally, after the backward elimination, how can I choose the best model? I know I can compare their ROC, AUC or accuracy but I also need to take into account whether the hypothesis are verified and whether it's easily interpretable for anyone not familiar with data science. For example, I would like to avoid so-called "blackbox" models such as Neural Nets but would you consider Random Forest as a blackbox model?
Thank you all for your help, I'm getting a bit stressed by my incoming deadline and will appreciate all advices
However, I'd like to have between 10-15 attributes approximately and to do so, I'm using a backward elimination. This means that I have to already choose a predictive model and for now, I chose either Random Forest or Logistic Regression. My first question would be: which backward elimination model choosing knowing that the results are much different whether I do it with a logistic regression or with Random Forest? Then, I have mainly 2 questions:
1. ASSUMPTIONS BEHIND MODELS
For logistic regression, if I choose this one, I am not sure how to verify the assumptions linked to this model in Rapid Miner:
- I have no idea how to test in Rapid Miner the linearity between the independent variables and the log odds of the dependent variable. I read a bit about Box-Tidwell test but can't find that on RM. Do you have any advices or help to offer?
- Hypothesis of Little or no multicolinearity among the variables: is it problematic as I know some are still correlated above 50%?
2. CHOOSING THE MODEL - Blackbox models
Finally, after the backward elimination, how can I choose the best model? I know I can compare their ROC, AUC or accuracy but I also need to take into account whether the hypothesis are verified and whether it's easily interpretable for anyone not familiar with data science. For example, I would like to avoid so-called "blackbox" models such as Neural Nets but would you consider Random Forest as a blackbox model?
Thank you all for your help, I'm getting a bit stressed by my incoming deadline and will appreciate all advices
