🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"How to test the assumptions of logistic regression in RM?"

User: "ophelie_vdp"
New Altair Community Member
Updated by Jocelyn
Hello ! I am getting a bit lost in my analysis and I am stuck between two steps of my methodology. I had a classification problem (churn) with a dataset of 100 variables for 100 000 examples and, after removing attributes with too many missing values and those that were too correlated to others (pairs with a correlation above 75%) or with a too small variance, I have 46 attributes and 80 000 examples in my training set. 

However, I'd like to have between 10-15 attributes approximately and to do so, I'm using a backward elimination. This means that I have to already choose a predictive model and for now, I chose either Random Forest or Logistic Regression. My first question would be: which backward elimination model choosing knowing that the results are much different whether I do it with a logistic regression or with Random Forest? Then, I have mainly 2 questions:

1. ASSUMPTIONS BEHIND MODELS
For logistic regression
, if I choose this one, I am not sure how to verify the assumptions linked to this model in Rapid Miner:
  • I have no idea how to test in Rapid Miner the linearity between the independent variables and the log odds of the dependent variable. I read a bit about Box-Tidwell test but can't find that on RM. Do you have any advices or help to offer?
  • Hypothesis of Little or no multicolinearity among the variables: is it problematic as I know some are still correlated above 50%?
Plus, what would be the assumptions behind the models of Random Forest or Rule Induction ? I'm trying to avoid to violate some important assumptions when using my models and my statistic courses are not so fresh in my memory. And for all those models, do they work better after normalisation of the numeric variables? Should I normalise data in order not to violate the assumptions?

2. CHOOSING THE MODEL - Blackbox models
Finally, after the backward elimination, how can I choose the best model? I know I can compare their ROC, AUC or accuracy but I also need to take into account whether the hypothesis are verified and whether it's easily interpretable for anyone not familiar with data science. For example, I would like to avoid so-called "blackbox" models such as Neural Nets but would you consider Random Forest as a blackbox model?

Thank you all for your help, I'm getting a bit stressed by my incoming deadline and will appreciate all advices :) 

Find more posts tagged with