A Deployable Non-Linear Mathematical Optimization Example with SLC
This example is based on a real customer requirement in which a GAM model has already been developed to capture the non-linearities in many of the independent variables. The aim is to build a mathematical optimization around the GAM model such that the optimization can be deployed to production.
The requirements to have it in production is that the lower and upper bounds of the decision variables can be controlled on a website by the user. This means the values for these bounds need to be passed back into the optimization algorithm and the optimized decision variables need to be returned for display.
GAM Non-Linear model
The model development will be described using a generic example instead of Market Optimization. This facilitates developing a generic prototype as well as avoiding sensitive information.
The raw dataset:
title 'Patterns of Diabetes'; data diabetes; @; logCP = log(CPeptide); datalines; 5.2 -8.1 4.8 8.8 -16.1 4.1 10.5 -0.9 5.2 10.6 -7.8 5.5 10.4 -29.0 5.0 1.8 -19.2 3.4 12.7 -18.9 3.4 15.6 -10.6 4.9 5.8 -2.8 5.6 1.9 -25.0 3.7 2.2 -3.1 3.9 4.8 -7.8 4.5 7.9 -13.9 4.8 5.2 -4.5 4.9 0.9 -11.6 3.0 11.8 -2.1 4.6 7.9 -2.0 4.8 11.5 -9.0 5.5 10.6 -11.2 4.5 8.5 -0.2 5.3 11.1 -6.1 4.7 12.8 -1.0 6.6 11.3 -3.6 5.1 1.0 -8.2 3.9 14.5 -0.5 5.7 11.9 -2.0 5.1 8.1 -1.6 5.2 13.8 -11.9 3.7 15.5 -0.7 4.9 9.8 -1.2 4.8 11.0 -14.3 4.4 12.4 -0.8 5.2 11.1 -16.8 5.1 5.1 -5.1 4.6 4.8 -9.5 3.9 4.2 -17.0 5.1 6.9 -3.3 5.1 13.2 -0.7 6.0 9.9 -3.3 4.9 12.5 -13.6 4.1 13.2 -1.9 4.6 8.9 -10.0 4.9 10.8 -13.5 5.1 ; |
The initial GAM model:
TITLE 'Results of GAM model'; ODS GRAPHICS ON; PROC GAM DATA=diabetes plots(unpack)=components(commonaxes clm); MODEL logCP = spline(Age) spline(BaseDeficit); OUTPUT OUT=outputSet PREDICTED LINP; RUN; ODS GRAPHICS OFF; |
The use of the GAM model captures non-linear effects from the variables (Age, BaseDeficit). The actual result of the GAM model has the form of a regression but with the non-linear effects captured by the splines. The model can be summarized as:
y = logCP(Age,BaseDeficit) = Intercept +Linear(Age)⋅Age + Linear(BaseDeficit)⋅Deficit +…..
…. Spline(Age)+Spline(BaseDeficit)….
The output dataset called 'outputSet', contains the necessary information in the formula above to calculate the result. The Intercept, Linear(Age) and Linear(BaseDeficit) are the coefficients of regression found in the table of 'Parameter Estimates'. Note that Spline(Age) and Spline(BaseDeficit) are NOT coefficients, but they are the non-linear effects added to the regression equation above. Essentially, each {Spline(x)} vector is the same length as the input variable vector {x}. The Spline(x) data is found in the ' outputSet' as column predicted_X.
Regression Model
Optimization requires the model to be used for scoring purposes alone, i.e. without model fitting each time. The GAM procedure however, requires model fitting and scoring to be done in any one call. Hence this section is about developing a second model that has an independent scoring process than the model fitting.
The results from the above GAM procedure is:
Obs | Age | BaseDeficit | CPeptide | logCP | predicted | linp | predicted_AGE | predicted_BASEDEFICIT |
1 | 5.2 | -8.1 | 4.8 | 1.56862 | 1.54240 | 1.49082 | 0.0640784089 | -0.012491911 |
2 | 8.8 | -16.1 | 4.1 | 1.41099 | 1.49619 | 1.47014 | 0.045848366 | -0.01980206 |
3 | 10.5 | -0.9 | 5.2 | 1.64866 | 1.68099 | 1.63213 | 0.0265245723 | 0.0223361485 |
4 | 10.6 | -7.8 | 5.5 | 1.70475 | 1.58433 | 1.57112 | 0.0250266292 | -0.01181451 |
The results of the GAM procedure has transformed Dependent Variable logCP, the independent variables (Age, BaseDeficit) and the splines (predicted_Age, predicted_BaseDefect). To build the new Scoring Model, in addition to the raw data (Age, BaseDeficit, CPeptide), the splines data is also needed. Hence the GAM model will also include the SCORE keyword as shown below:
The initial GAM model:
TITLE 'Results of GAM model'; ODS GRAPHICS ON; PROC GAM DATA =diabetes plots(unpack)=components(commonaxes clm); MODEL logCP = spline(Age) spline(BaseDeficit); SCORE DATA=diabetes OUT=outputScore; OUTPUT OUT=outputSet PREDICTED LINP; RUN; ODS GRAPHICS OFF; |
The outputScore dataset from the GAM model above, will now be used in a Regression model fitting, so that the scoring can be done independently using SCORE Procedure.
PROC REG DATA = outputScore OUTEST=RegOut; OxyHat: MODEL logCP = Age BaseDeficit p_AGE p_BASEDEFICIT; TITLE 'Regression Scoring Example'; RUN; PROC PRINT DATA=RegOut; TITLE2 'OUTEST= Data Set from PROC REG'; RUN; |
Note that the Regression model uses outputScore as the input to fit the Regression. This dataset is generated from the SCORE statement of the GAM procedure. In other words, we can use Design Of Experiments to create a much larger set of input variables, having only {Age, BaseDeficit} as independent variables, and the resultant outputScore will have the splines effects {p_AGE, pBASEDEFICIT} which can then be used to fit/train the Regression model.
Scoring
The PROC SCORE is used here with the model created above, 'RegOut’, to calculate predictions using the input data 'outputScore'.
PROC SCORE DATA = outputScore SCORE =RegOut OUT=RScoreP TYPE=parms; VAR Age BaseDeficit p_AGE p_BASEDEFICIT; RUN; PROC PRINT DATA =RScoreP; TITLE2 'Predicted Scores for Regression'; RUN; |
A snapshot of results is shown below. The columns to compare here are P (prediction of logCP from GAM procedure) with the end result OXYHAT. In production, the SCORE DATA line in GAM Procedure will be using a new set of input variables, thus leading to an entirely different outputScore. Here, the SCORE DATA line is using the original raw data so that we can compare results at the end to check this proposed model.
Obs | Age | BaseDeficit | CPeptide | logCP | P | P_LINK | LINP | P_AGE | P_BASEDEFICIT | OXYHAT |
1 | 5.2 | -8.1 | 4.8 | 1.56862 | 1.54240 | 1.54240 | 1.49082 | 0.0640784089 | -0.012491911 | 1.54680 |
2 | 8.8 | -16.1 | 4.1 | 1.41099 | 1.49619 | 1.49619 | 1.47014 | 0.045848366 | -0.01980206 | 1.49566 |
3 | 10.5 | -0.9 | 5.2 | 1.64866 | 1.68099 | 1.68099 | 1.63213 | 0.0265245723 | 0.0223361485 | 1.68876 |
4 | 10.6 | -7.8 | 5.5 | 1.70475 | 1.58433 | 1.58433 | 1.57112 | 0.0250266292 | -0.01181451 | 1.58413 |
5 | 10.4 | -29.0 | 5.0 | 1.60944 | 1.59458 | 1.59458 | 1.37638 | 0.0278986807 | 0.1902954402 | 1.63298 |
Mathematical Optimization
The general components of mathematical optimization are:
- Objective Function
- Decision Variables
- Limits / Constraints
Machine Learning and Predictive Models are independent in concept to mathematical optimization. The predictive model may come into the optimization setup; usually in the objective function. From discussions above, the predictive model is the Regression model for the target variable logCP. In this case it happens that the Objective Function is exactly the predictive model. (Keep in mind the underlying non-linearities has been captured by the GAM model, on top of which the Regression model was created.)
/* ***** Initial DUMMY Data ***** */ /* REPLACE these with Inputs from SLC HUB REST API */ DATA xOptMatrix_ReplaceMe; INPUT xInitial xLower xUpper; /* x-variables are {Age BaseDeficit P_Age P_BASEDEFICIT} */ DATALINES; 5.2 1 16 -8.1 -30 0 0.064 -0.2 0.1 -0.0125 -0.05 0.20 ; RUN; |
The Decision Variables for mathematical optimization are defined in the matrix xOptMatrix_ReplaceMe above. It consists of 3 columns which are the initial values, lower bounds and upper bounds respectively for each of the decision variables. In this case there are 4 decision variables, hence 4 rows.
/* The section below with 'New Data vector dataset', 'outputScore', need to be replaced by REAL input data */ PROC IML;
/* Converting SAS dataset into IML Matrix */ * Regression model dataset; USE work.Regout; vars_Intcpt = {Intercept Age BaseDeficit P_Age P_BASEDEFICIT}; READ POINT 1 VAR vars_Intcpt INTO regModelMatrix; CLOSE work.Regout; PRINT regModelMatrix;
* New Data vector dataset; * SLC Hub Deployment, DON'T NEED to change this section. Just change "DATA xOptMatrix_ReplaceMe" definitaion above; USE xOptMatrix_c; vars_xInitLowerUpper = {xInitial xLower xUpper}; READ ALL VAR vars_xInitLowerUpper INTO xInitLowerUpper; CLOSE xOptMatrix_ReplaceMe; PRINT xInitLowerUpper;
START F_REGSCORE(x) GLOBAL(regModelMatrix); * f = dot_product(model coefficients , new data) + INTERCEPT; f = SUM(regModelMatrix[1,2:5] # x) + regModelMatrix[1,1]; return(f); FINISH F_REGSCORE; |
The snippet above contains 3 sections described below. Some operations below involve reading the datasets and converting to SAS/IML matrix form required by the SAS/IML optimization routines.
Section 1. Using the stored/saved regression model 'work.Regout’, it converts the single row (POINT 1) representing the model, into a matrix called regModelMatrix. These are essentially the coefficients of the regression model.
Section 2. Converts the input dataset regModelMatrix into a matrix called xInitLowerUpper.
Section 3. The objective function is defined here as F_REGSCORE. In this business problem, the objective function is exactly equal to the prediction of Regression model. However, PROC SCORE cannot be used to calculate the prediction because this code is inside the function definition (START F_REGSCORE .... FINISH F_REGSCORE) which must be passed into an optimization function later. To overcome this problem, we simply use the definition of the linear regression formula, which is to multiply the vector of variables with the vector of coefficients (plus the Intercept of course).
* TESTING ONLY New Data vector dataset; USE outputScore; vars1 = {Age BaseDeficit P_Age P_BASEDEFICIT}; READ POINT 1 VAR vars1 INTO testRowData; CLOSE outputScore; PRINT testRowData;
fscore = F_REGSCORE(testRowData); PRINT fscore; * Initial value should be: 1.546704616 ; |
Having defined the components that are necessary to feed into the optimization routines, we can now do a quick test to check the code integrity, i.e. to make sure the data are in the correct structures, etc. The test above takes a row of variables from the dataset outputScore, representing just a single data point, and feeds into the objective function F_REGSCORE.
/* Optimization: Test example using Predictive Model, function F_REGSCORE*/ xinit = xInitLowerUpper[,1]`; * 1st column, then TRANSPOSE into row vector.; rc = 0; * Return code; xres = j(1, 4, 0); * Empty row vector of 5 elements; opt={0 3}; * Options, see official docs on non-linear opt.;
/* HARD CODING below because this blc non-linear constraint can be quite general; ie it can be expanded later for more constraints con = { xInitLowerUpper[1,2] xInitLowerUpper[2,2] xInitLowerUpper[3,2] xInitLowerUpper[4,2], xInitLowerUpper[1,3] xInitLowerUpper[2,3] xInitLowerUpper[3,3] xInitLowerUpper[4,3] };*/ con = j(2,4,0); * 2 rows, 4 cols; con[1,] = xInitLowerUpper[, 2]`; * con(1st row) = lower bounds; con[2,] = xInitLowerUpper[, 3]`; * con(2nd row) = upper bounds; PRINT con;
CALL nlpnms(rc, xres, "F_REGSCORE", xinit, opt, con, , , , );
PRINT xres; RUN; |
Finally, this last snippet implements the mathematical optimization process. There are many different non-linear optimization solvers and the one chosen here is nlpnms() - the Nelder-Mead simplex optimization. This routine is one of the few routines that does not need an algebraic formulation, especially the gradients, hessians, jacobians of the objective function. For more information, please search the internet with the keyword nlpnms.
The challenge of using the objective functions are the numerous options. For more details, please refer to the official documentation of the nlpnms function. One thing to note is the very specialised format to capture the constraint equations and the limits of the decision variables, in the variable con. In the implementation above, only the lower and upper bounds of the decision variables are considered. The structure of the con is known as the 'blc matrix’ - please search the internet for more details on this.
The result of the optimization process is found in xres, which is the set of values for the decision variables, that optimizes the objective function.