nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Siemens Community Catalyst Program

The Siemens Community Catalyst program was co-created with our community to acknowledge technology leaders who consistently contribute to the Siemens Community. Nominations are accepted on a rolling basis.

Nominate Now

Columns with too many values

Chemical_eng

Hello.

I am using AutoModel for a regression problem ( my target is continuous). I have 3 input parameters for which I have categorical values. For one of them I have 27 values, for the other 16, but for another I have 107. I have toggled off the option of "Remove columns with too many values". Does this ensure that the one hot encoding is performed correctly for the column with 107 values ?

What does it mean when for different categories in the generalized linear model I have coefficient 0 for many categories , is it not taking the impact ?

Thanks

Find more posts tagged with

AI Studio

Auto Model

Accepted answers

YYH

Hi @Chemical_eng,

Thanks for sharing your experience using AutoML for a regression problem.

I have toggled off the option of "Remove columns with too many values". Does this ensure that the one hot encoding is performed correctly for the column with 107 values ?

Yes and No. RapidMiner AutoML by default, uses "Target encoding" to remove attributes with too many values and no encoding performed. However, GLM algorithm itself will handle categorical columns directly by one-hot encoding (internally). You don't have to transform the nominal to numerical beforehand for GLM. We strongly recommend avoiding one-hot encoding categorical columns with any levels into many binary columns, as this is very inefficient. That is why we perform target encoding before the GLM internal one-hot encoding.

I tested the Titanic data in AutoML to predict the passenger fare.
open the process here

Image: https://us.v-cdn.net/6038102/uploads/editor/7c/6wqmyp3mx4vg.png

In Design view, you can locate the operator that handle nominal attributes (another tip, activate the Tree view ). Here it is.

Image: https://us.v-cdn.net/6038102/uploads/editor/z0/b9tgi8236zcq.png

Inside the subprocess "Basic Feature Engineering", you can find "Target Encoding" instead of one hot encoding as shown in my example. If turn on "Remove cloumns with too many values" with a max num of values set as 10, the Target encoding model will remove the attribute "Life boat", but no encodings as default. Here you can customize it by replacing with one-hot encoding operators.

What does it mean when for different categories in the generalized linear model I have coefficient 0 for many categories , is it not taking the impact ?

The too many of zero coefficients is usually comes from the "regularization" in GLM. Simply put, Regularization is used to reduce the number of predictors in the model to reduce variance of the prediction error, to handle correlated predictors, and to avoid overfitting. https://en.wikipedia.org/wiki/Regularization_(mathematics)

Image: https://us.v-cdn.net/6038102/uploads/editor/q8/qbcxa0ow49p5.png

Again, in the process view, you can toggle off the option of regularization.

Hope it helps.

Cheers,
YY

All comments

YYH

Hi @Chemical_eng,

Thanks for sharing your experience using AutoML for a regression problem.

I have toggled off the option of "Remove columns with too many values". Does this ensure that the one hot encoding is performed correctly for the column with 107 values ?

In Design view, you can locate the operator that handle nominal attributes (another tip, activate the Tree view ). Here it is.

What does it mean when for different categories in the generalized linear model I have coefficient 0 for many categories , is it not taking the impact ?

Again, in the process view, you can toggle off the option of regularization.

Hope it helps.

Cheers,
YY

Chemical_eng

Many thanks for this answer

Chemical_eng

I performed the procedure, but then when I open the model simulator operator results it shows me one input variable per category ( like it left it with the encoding) ... this is not what I want

YYH

Thank you @Chemical_eng! The model simulator from AutoML will use the data before one-hot encoding handled by GLM.

Like the screenshot shows, we have a dropdown list of all possible values in the categorial variable.

If you are available for a follow-up, I could walk you through the details in a quick call.

Image: https://us.v-cdn.net/6038102/uploads/editor/6t/q0yfudanqahc.png

Chemical_eng

Yes I would like to have a call because after updating to one hot encoding my simulator does not show it as that . How can we arrange this ? thanks