Dear RM community,
Is somebody able to help me a bit closer. I know data mining approaches are sometimes different from the way a researcher needs to present his results. This paper uses data from the MIMIC II database which is a clinical database with 40000 ICU patients (
https://mimic.physionet.org/). I thinks the authors have done a nice job and I would like to use this approach for the analysis of other attributes. My data is preprocessed but can't find how to use a variance inflation factor, the lowest smooth technique and finally to have the odds ratio calculated and presented in the results.
Hoping someone can help me.
Cheers
Sven
This article is the subject of my question:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0095204In the methods I read: Continuous variables were tested for normality by using Kolmogorov–Smirnov test. Data of normal distribution were expressed as mean±SD and compared using t test. Otherwise, Wilcoxon rank-sum test was used for comparison. Categorical variables were expressed as percentage and compared using Chi square test or Fisher's exact test as appropriate. ICU mortality was used as the study endpoint. To exclude confounding factors that may influence the association of iCa and mortality, logistic regression model was used to adjust for the odds ratios (OR). We built two models separately for Ca0 and Camean during ICU stay. The full model included all variables listed in Table 1.[8] Covariate selection was performed by using stepwise forward selection and backward elimination technique, with Ca0 and Camean remaining in the model. The significance level for selection was predefined as 0.15 and that for elimination was 0.2. After this step the main effect model was built. Lowess smooth technique was used to examine the relationship between iCa and mortality in logit.[9] To facilitate clinical interpretation of our results and to meet the interests of subject-matter audience, we planned to use linear spline function for model building.[10] The knots were chosen according to conventional classification of iCa ranges: relative to the normal range of 1.15–1.25 mmol/L, we defined hypocalcemia as mild, moderate and severe as 0.9–1.15, 0.8–0.9 and <0.8 mmol/L, respectively. Hypercalcemia was divided into mild, moderate and severe as 1.25–1.35, 1.35–1.45 and >1.45 mmol/L, respectively.[11], [12] Potential multicollinearity between covariates in the model were quantified by using variance inflation factor (VIF) which provided an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity.[13] As a common rule of thumb, a VIF>5 was considered for the existence of multicollinearity. Furthermore, iCa was categorized into intervals and incorporated into regression models as design variable. Design variable, also known as dummy variable, is one that takes the value of 0 or 1 to indicate the presence or absence of some categorical effect that is expected to shift the outcome. It is frequent used for categorical variables with more than two categories. Normal range between 1.15 and 1.25 mmol/l was used as reference and ORs were reported for other intervals. Receiver operating characteristic curve (ROC) was depicted to show the diagnostic performance of fitted logistic regression models.