Why do we need to normalise data and group them together?

filan
filan New Altair Community Member
edited November 2024 in Community Q&A

Hello fellow practitioners,

 

I have a statistic question and hopefully, someone can explain to me.

 

I am trying to solve a linear regression problem and trying to impute missing values. This is a setup done by my professor and we are required to find out the intent of his setup.

 

This is his setup, Impute Missing Values -> Optimize Parameters (Grid) -> Cross Validation 

 

Screen Shot 2017-05-27 at 10.54.40 PM.png

 

According to my understanding, this setup is essentially trying to use k-NN to locate k nearest data and then create a value to fill the missing columns. I do not understand is why do we need to normalize the data first then pass the preprocessing model together with the output of k-NN into Group Models operator? I believe the same goal can be achieved without both Normalize and Group Models operator, right?

 

Or is it trying to obtain the best k-value?

 

Tagged:

Best Answer

  • Thomas_Ott
    Thomas_Ott New Altair Community Member
    Answer ✓
    The set up of your professor is correct. N is susceptible to scaling issues, this is why he is normalizing data. In order to honestly evaluate if the model is good, you're using cross validation. You put the normalize and the group models operator on the training side because this will normalize the data on the training side with the same zero mean and variance and then apply it to the testing and measure the performance. The group models operator essentially applies the models in the same order first you create a normalize preprocessing model then you create a cane and model so you apply the normalize model first to your test data and then you apply the Knn model to your test data and then measure performance.

Answers

  • Thomas_Ott
    Thomas_Ott New Altair Community Member
    Answer ✓
    The set up of your professor is correct. N is susceptible to scaling issues, this is why he is normalizing data. In order to honestly evaluate if the model is good, you're using cross validation. You put the normalize and the group models operator on the training side because this will normalize the data on the training side with the same zero mean and variance and then apply it to the testing and measure the performance. The group models operator essentially applies the models in the same order first you create a normalize preprocessing model then you create a cane and model so you apply the normalize model first to your test data and then you apply the Knn model to your test data and then measure performance.
  • filan
    filan New Altair Community Member

    k-NN is susceptible to scaling issues?

  • Telcontar120
    Telcontar120 New Altair Community Member

    Definitely.  Most of the distance metrics for numerical attributes (euclidian, manhattan, etc.) are calculated in the raw units in which the attributes are measured.  So if you have two (or more) attributes in the dataset that have very different scales, it will cause the total distance function to be dominated by the attributes that have the highest absolute values.  You can easily verify this for yourself by taking a dataset and running kNN on it both normalized and un-normalized and observing the differences in predictions.

     

  • filan
    filan New Altair Community Member

    thanks for the thorough explanation