Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Trying to understand MLP output

Hi,

I am currently trying to understand the output of the W-MultilayerPerceptron operator. Let us consider a toy model without hidden layers. Output might look like this.


Linear Node 0
     Inputs    Weights
     Threshold    0.4052907755005098
     Attrib O3    -0.2617907901506467
     Attrib NO2    -0.05083306647141619
     Attrib Altitude    -0.14881316186685326
     Attrib z    0.35660878655615114
     Attrib sza_rad    -0.44846864905805994
 Class
      Input
     Node 0

From my understanding this should be equivalent to a linear regression. So I train a LinearRegression model with the same input data using the results from the above "MLP" as label (in order to rule out differences in the fitting algorithm). Results show that the model indeed reproduces the results from the "MLP" perfectly. The coefficients however are completely different:


- 0.0000070221 * O3 
- 0.0000717637 * NO2 
- 0.0004435178 * Altitude 
+ 0.0003188475 * z 
- 0.0040543204 * SZA*pi/180. 
+ 0.0145570907

I assume that this is because of the normalization done in the MLP operator. So here's the question: Assume I want to implement the above "MLP" into my own code: How must I process my data and the results?

Thanks for your reply

Find more posts tagged with

AI Studio

Accepted answers

All comments

wessel

From my understanding Linear Regression and a Single Layer Perceptron should produce different weight values.

A single layer perceptron starts with random weights.
Takes a single data points.
Propagates the input forward in the network.
Calculates the error.
Finds the weight gradient that minimizes the error.
Moves the weights in the direction of the gradient according to the learning speed.
Repeat.

Linear regression calculates the optimal weights in closed form.

At data normalisation.
The Neural Net has the option to turn of the data normalisation.

I think you could also normalise your data, so nothing changes.
using: (value - min) / (max - min)

herbert12345

Thank you for your reply.

I understand that they might go different ways to obtain their weights. But assuming a fair amount of convergence, the weights should end up being about the same. Up to normalization that is. Indeed I manage to make them the same by turning on the "I" and "C"-options in the W-MLP operator.

I think I have managed to understand how things work by now. The problem was in part caused by a misunderstanding of mine as to how things work. Still it troubles me that the W-MLP output is not complete in the sense that the normalization employed is not documented. (I believe now that it normalizes both attributes and labels to the interval [-1,1] using 2*(value-min)/(max-min)-1).

What bothers me though is that my final model (i.e. with hidden layers) appears to have a certain bias. Well, I guess I can fix that.

Thanks for helping

wessel

I believe this is standard when tanh sigmoid functions are used:
2*(value-min)/(max-min)-1 [-1,1]

When the normal sigoid, which is 1 / 1 + exp(-x) is used, its normalised to
(value-min)/(max-min) [0, 1]

This is indeed poorly documented.

Should I take a look in WEKA's source code? Or the RM source code?

What you mean that final model have a certain bias?
Don't all learners have a certain bias?

edit:
this link very shortly mentions normalisation:
http://en.wikiversity.org/wiki/Learning_and_neural_networks

herbert12345

This kind of makes sense. Although through exerimentation I found that the only way to get things right is to normalize to [-1,1] and use standard sigmoid nodes as in 1/(1+exp(-x)). Maybe a look into the source code might help to clear things up.

About the bias: Looking closer I see that for some reason the prediction is actually wrong by a linear map, that is I get good correlations (as in 0.999...) but scatter plots show that the model is rather off. This could easily be fixed by applying a linear model in post of course but I think it is strange. d

Edit: My fault. Shouldn't wonder about offsets if training data and validation data are processed in different ways ... :-[