question marks in linear regression output

AD2019
AD2019 New Altair Community Member
edited November 2024 in Community Q&A
I ran a linear regression model with 18 independent variables and feature selection turned off.  For some of the independent variables there were question marks for the standard error of the estimate, and therefore for the t-statistic and p-value for the coefficient.  I ran the mode again with feature selection turned on and got the same question marks.  What do these question marks mean?  Thay cannot have anything to do with missing values as the regression would not have run to completion in that case.  I am baffled about what these "?" symbols might mean.  Help..... 

Best Answers

  • varunm1
    varunm1 New Altair Community Member
    edited November 2019 Answer ✓
    Hello @sgenzer and @AD2019

    I tried to look at H2O documentation on linear regression, unfortunately, I found none. For GLM to provide p-values, there is a mandatory parameter selection that H2O recommends to get values without "?" (Unknown)

    1. You should uncheck the " Use Regularization" option.
    2. You should select "Add intercept"
    3. You should select " compute p-values"
    4. You should select " remove collinear columns"

    If these are set then you will get the p values, std.error, etc without question marks. You will get question marks in this case only when the coefficient is 0.

    I will see if I can find any information on linear regression.

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    Can you post your process xml?  Do you have the bias parameter checked in the LR operator or the exclude collinear features?  There are several options that can affect the output.

  • AD2019
    AD2019 New Altair Community Member
    Hi, I have attached my process rmp file.  the 'exclude collinear features' is unchecked.  and you are correct about the bias thing.  if 'use bias' is checked, i do not get question marks.  if it is unchecked, i do get question marks.  I did all this with 'feature selection' turned off.  Something else is also strange.  I then turned on feature selection and used T_Test as the selection method with alpha set to 0.05.  I got a solution that included Independent variables with p-value much much higher than 0.05.  I am confused why these IVs were not trimmed from the output. thanks in advance for your help.
  • AD2019
    AD2019 New Altair Community Member
    by the way, regardless of the cause, I would like to know what the question mark in the regression output is trying to communicate to the user.  does it mean a computational underflow or overflow or a computational error or what?
  • sgenzer
    sgenzer
    Altair Employee
    hi @AD2019 I'm picking up this thread here. I have your process (thank you) but not the data set - hence I cannot run the process. Can you pls post?
  • AD2019
    AD2019 New Altair Community Member
    my apologies for this delay in posting the data file.  please see attached.  when i run the regression without bias, I get question marks in the regression model.  What does that mean? the process files was posted earlier (RM-houseprice-process.rmp).  
  • sgenzer
    sgenzer
    Altair Employee
    hi @AD2019 do you mean these ? marks?



    So the simple answer is that ? marks are used in RapidMiner when values are missing. The better question is why are they missing...my educated guess here (pls correct me @varunm1 @mschmitz if my stats are wrong here) is that there can be no std coefficient or tolerance for an intercept of a LinReg model as it's a computed value. All of your actual data (the other attributes) have std coefficients which make sense. But my stats are a wee bit rusty so I look to these other smart folks to correct me. :wink:

    Scott

  • AD2019
    AD2019 New Altair Community Member
    Hi Scott:
    if you run the process with bias turned off, you will get questions marks for some of the independent variables as well, not just the intercept.  Since there is a question mark on the standard error for these variables, the t-statistic and p-values also have question marks on them.  So it is not just an issue of the intercept.  The data set does not have missing values, so I could not figure out what the question marks were trying to say.  The only thing I could think of was numerical overflow or underflow when calculating the standard error of the associated variable, but then I could not see how the coefficients would have been computed.
    Amit
  • sgenzer
    sgenzer
    Altair Employee
    hi Amit -

    Ah I understand. Good point. It's been a while since I've played with all of this (we normally use the GLM modeler instead of LinReg as it is far more versatile and robust). Let me investigate.

    Scott

  • AD2019
    AD2019 New Altair Community Member
    thanks Scott.  Let me play around with GLM and see if I can get rid of the ?
  • varunm1
    varunm1 New Altair Community Member
    edited November 2019 Answer ✓
    Hello @sgenzer and @AD2019

    I tried to look at H2O documentation on linear regression, unfortunately, I found none. For GLM to provide p-values, there is a mandatory parameter selection that H2O recommends to get values without "?" (Unknown)

    1. You should uncheck the " Use Regularization" option.
    2. You should select "Add intercept"
    3. You should select " compute p-values"
    4. You should select " remove collinear columns"

    If these are set then you will get the p values, std.error, etc without question marks. You will get question marks in this case only when the coefficient is 0.

    I will see if I can find any information on linear regression.
  • AD2019
    AD2019 New Altair Community Member
    thank you Varun.