"Using RM to optimize R hyperparameters"
keith
New Altair Community Member
Hi,
I'm interested in using RapidMiner to find the optimal values for hyperparameters tuning an R model In particular, I'd like to use EvolutionaryOptimization to do so. But I've run into several issues I can't quite figure out myself.
I've got a simple test case that demonstrates what I want to do. An R script builds a model using the "penalized" function R package "penalized", and takes a parameter lambda2 that controls how severe a penalty is applied. The goal in the process is to optimize the value of lambda2. I use 10-fold cross-validation to estimate the generalization with each penalty factor tried. The example works, selecting 100 as the best parameter on the list. But I can't get it to run using evolutionary parameter optimization, primarily because I can't seem to construct and pass a numeric parameter into the R script.
Questions:
1) How I can I specify a numeric parameter to be used inside the R code? The grid optimization is using a list of values to set the value of a macro definition "lambda2" inside the validation. I can then use the macro inside the R code to vary the penalty. But if I try to replace the grid optimization with evolutionary optimization, I am not permitted to specify a range because the macro value could be a string rather than a numeric I couldn't see another way to pass a parameter value into R code other than the macro approach.
2) In cross-validating, the R script "Build Training Model" returns an R object, not a model, so I couldn't directly connect the port to pass to the testing side. I got around this by storing the R object in the repository, and retrieving it on the testing side. This seems awkward, but I couldn't figure out how to pass an R object around otherwise. Then in order to get RM to accept the process, I had to connect the R object to the model port on the training side, even though it complains that they aren't compatible objects. Is there a better way to do this?
3) There doesn't appear to be a way within a process to delete an object from the repository? I'm temporarily storing an R object in the repository during cross-validation, and wanted to remove them when completed, but the only two operators are Store and Retrieve. If I could solve 2) without using the repository, this concern would go away for now, although I can see the functionality being pretty important. Did I miss something obvious?
4) Because RM doesn't seem to know about applying R models, I manually constructed a performance vector from the testing label and the R-generated predictions within another R script to calculate performance. Seems to work, although RM complains about metadata being unspecified when I connect the constructed example set to the label port on the Performance operator. Not a big deal, but thought it worth mentioning in case there's a cleaner way to do this.
5) The "results.label <- column_name" trick for setting roles on a R data frame when converted back to an RM data table worked for label, but not for prediction, which is why the "Change role" operator is in the process.
Note that you'll need R package "penalized" to be installed in order for this test case to work.
Any suggestions would be welcomed. I want to use RM to do a lot of this kind of parameter tuning, since I find similar capabilities in R somewhat lacking. Thanks for any help.
Keith
I'm interested in using RapidMiner to find the optimal values for hyperparameters tuning an R model In particular, I'd like to use EvolutionaryOptimization to do so. But I've run into several issues I can't quite figure out myself.
I've got a simple test case that demonstrates what I want to do. An R script builds a model using the "penalized" function R package "penalized", and takes a parameter lambda2 that controls how severe a penalty is applied. The goal in the process is to optimize the value of lambda2. I use 10-fold cross-validation to estimate the generalization with each penalty factor tried. The example works, selecting 100 as the best parameter on the list. But I can't get it to run using evolutionary parameter optimization, primarily because I can't seem to construct and pass a numeric parameter into the R script.
Questions:
1) How I can I specify a numeric parameter to be used inside the R code? The grid optimization is using a list of values to set the value of a macro definition "lambda2" inside the validation. I can then use the macro inside the R code to vary the penalty. But if I try to replace the grid optimization with evolutionary optimization, I am not permitted to specify a range because the macro value could be a string rather than a numeric I couldn't see another way to pass a parameter value into R code other than the macro approach.
2) In cross-validating, the R script "Build Training Model" returns an R object, not a model, so I couldn't directly connect the port to pass to the testing side. I got around this by storing the R object in the repository, and retrieving it on the testing side. This seems awkward, but I couldn't figure out how to pass an R object around otherwise. Then in order to get RM to accept the process, I had to connect the R object to the model port on the training side, even though it complains that they aren't compatible objects. Is there a better way to do this?
3) There doesn't appear to be a way within a process to delete an object from the repository? I'm temporarily storing an R object in the repository during cross-validation, and wanted to remove them when completed, but the only two operators are Store and Retrieve. If I could solve 2) without using the repository, this concern would go away for now, although I can see the functionality being pretty important. Did I miss something obvious?
4) Because RM doesn't seem to know about applying R models, I manually constructed a performance vector from the testing label and the R-generated predictions within another R script to calculate performance. Seems to work, although RM complains about metadata being unspecified when I connect the constructed example set to the label port on the Performance operator. Not a big deal, but thought it worth mentioning in case there's a cleaner way to do this.
5) The "results.label <- column_name" trick for setting roles on a R data frame when converted back to an RM data table worked for label, but not for prediction, which is why the "Change role" operator is in the process.
Note that you'll need R package "penalized" to be installed in order for this test case to work.
Any suggestions would be welcomed. I want to use RM to do a lot of this kind of parameter tuning, since I find similar capabilities in R somewhat lacking. Thanks for any help.
Keith
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.001" expanded="true" name="Process">
<process expanded="true" height="295" width="681">
<operator activated="true" class="generate_data" compatibility="5.1.001" expanded="true" height="60" name="Generate Data" width="90" x="112" y="165">
<parameter key="target_function" value="non linear"/>
</operator>
<operator activated="true" class="optimize_parameters_grid" compatibility="5.1.001" expanded="true" height="94" name="Optimize Parameters (Grid)" width="90" x="380" y="120">
<list key="parameters">
<parameter key="Set lambda2.value" value="10,100,1000"/>
</list>
<process expanded="true" height="313" width="1005">
<operator activated="true" class="x_validation" compatibility="5.1.001" expanded="true" height="112" name="Validation" width="90" x="447" y="38">
<parameter key="sampling_type" value="shuffled sampling"/>
<process expanded="true" height="313" width="477">
<operator activated="true" class="set_macro" compatibility="5.1.001" expanded="true" height="76" name="Set lambda2" width="90" x="45" y="30">
<parameter key="macro" value="lambda2"/>
<parameter key="value" value="1000"/>
</operator>
<operator activated="true" class="r:execute_script_r" compatibility="5.1.000" expanded="true" height="76" name="Build Training Model" width="90" x="180" y="30">
<parameter key="script" value="library(penalized) library(e1071) print(paste("lambda2 is:",%{lambda2})) mod.penalized <- penalized( 			 label ~ att1 + att2 + att3 + att4 + att5 			, data=my.data 			, standardize=TRUE 			, lambda1=10 			, lambda2=%{lambda2} 			) "/>
<enumeration key="inputs">
<parameter key="name_of_variable" value="my.data"/>
</enumeration>
<list key="results">
<parameter key="mod.penalized" value="Generic R Result"/>
</list>
</operator>
<operator activated="true" class="store" compatibility="5.1.001" expanded="true" height="60" name="Store" width="90" x="328" y="30">
<parameter key="repository_entry" value="PenalizedModel_temp"/>
</operator>
<connect from_port="training" to_op="Set lambda2" to_port="through 1"/>
<connect from_op="Set lambda2" from_port="through 1" to_op="Build Training Model" to_port="input 1"/>
<connect from_op="Build Training Model" from_port="output 1" to_op="Store" to_port="input"/>
<connect from_op="Store" from_port="through" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="313" width="496">
<operator activated="true" class="retrieve" compatibility="5.1.001" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="PenalizedModel_temp"/>
</operator>
<operator activated="true" class="r:execute_script_r" compatibility="5.1.000" expanded="true" height="112" name="Evaluate vs test data" width="90" x="180" y="30">
<parameter key="script" value="results <- cbind.data.frame( 			 actual = my.data$label 			, predicted = predict(model, data=my.data)[,1] 		) results.prediction <- "predicted" results.label <- "actual" "/>
<enumeration key="inputs">
<parameter key="name_of_variable" value="my.data"/>
<parameter key="name_of_variable" value="model"/>
<parameter key="name_of_variable" value="ignore_me"/>
</enumeration>
<list key="results">
<parameter key="results" value="Data Table"/>
</list>
</operator>
<operator activated="true" class="set_role" compatibility="5.1.001" expanded="true" height="76" name="Change role of prediction to prediction" width="90" x="315" y="30">
<parameter key="name" value="predicted"/>
<parameter key="target_role" value="prediction"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="performance_regression" compatibility="5.1.001" expanded="true" height="76" name="Performance" width="90" x="396" y="30"/>
<connect from_port="model" to_op="Evaluate vs test data" to_port="input 3"/>
<connect from_port="test set" to_op="Evaluate vs test data" to_port="input 1"/>
<connect from_op="Retrieve" from_port="output" to_op="Evaluate vs test data" to_port="input 2"/>
<connect from_op="Evaluate vs test data" from_port="output 1" to_op="Change role of prediction to prediction" to_port="example set input"/>
<connect from_op="Change role of prediction to prediction" from_port="example set output" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="averagable 1" to_port="performance"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 2"/>
<connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0
Answers
-
Hi Keith,
I'm glad to hear that the R Extension is used so broadly. And in the same time I'm a little bit sad, because it's not yet there where it could be. Why? Because there are already all needed components inside the extension: Learning Methods can be specified in an XML dialect where you can not only enter the R script for learning, but also for applying the model. You can even specify parameters and capabilities as used from RapidMiner operators.
Our plan is to give people like you the possiblity to encapsulate some R code completely transparent within an RapidMiner operator, so that it can easily interact with other RM operators. Your R code library then would rather change into some reusable operator library in RapidMiner, shareable with others...
If you are interested in this in such a detail, I would recommend to participate in the Special Interest Group for the R integration. I personally would prefer if we could take the discussion to there. If that's ok, send me your email address and I will send you an invitation.
Greetings,
Sebastian
PS: Yes, there's still no operator for deleting repository entries...But you could replace store/retrieve for temporary things by an faster remember/recall0