"How to carry out symbolic regression?"

mzn
New Altair Community Member
Is there any tutorials/examples on to how use RM to carry out symbolic regression?
Tagged:
0
Answers
-
Hi @mznTraditional approaches for symbolic regression often suffered from a phenomenon called feature bloat which is why they are hardly used any longer today. They have been replaced by a combination of linear regression (for assigning coefficients) with automatic feature generation approaches. In RapidMiner you would use a combination of the operators Generalized Linear Models with Automatic Feature Engineering for this. The multi-objective optimization approach keeps the feature bloat in check and therefore reduces the risk for overfitting. I have attached a small demo process below.I gave a presentation in London last week which also covered this to some degree. For this discussion I used similar data to the one in the example process mentioned above. I attached a couple of relevant slides showing a simple linear regression model, a decision tree model, a GBT model, and a model consisting of linear regression combined with automatic feature engineering. Like in symbolic regression, the resulting formula can be easily seen (in this case it was prediction(y) = 10,550 * |x| + 7,565 * x * |x|2 + 705 / |x| + 17,394.Here are some relevant links:
- https://docs.rapidminer.com/latest/studio/operators/modeling/optimization/automatic_feature_engineering.html
- https://docs.rapidminer.com/latest/studio/operators/modeling/predictive/functions/generalized_linear_model.html
- https://rapidminer.com/resource/automatic-feature-engineering/ - check out the video from minute 30 on
And finally the little demo process below.Hope this helps,
Ingo<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="UTF-8"/><br> <process expanded="true"><br> <operator activated="true" class="generate_data" compatibility="9.2.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="289"><br> <parameter key="target_function" value="one variable non linear"/><br> <parameter key="number_examples" value="3000"/><br> <parameter key="number_of_attributes" value="1"/><br> <parameter key="attributes_lower_bound" value="-25.0"/><br> <parameter key="attributes_upper_bound" value="25.0"/><br> <parameter key="gaussian_standard_deviation" value="10.0"/><br> <parameter key="largest_radius" value="10.0"/><br> <parameter key="use_local_random_seed" value="true"/><br> <parameter key="local_random_seed" value="1977"/><br> <parameter key="datamanagement" value="double_array"/><br> <parameter key="data_management" value="auto"/><br> </operator><br> <operator activated="true" class="add_noise" compatibility="9.2.000" expanded="true" height="103" name="Add Noise" width="90" x="179" y="289"><br> <parameter key="return_preprocessing_model" value="false"/><br> <parameter key="create_view" value="false"/><br> <parameter key="attribute_filter_type" value="all"/><br> <parameter key="attribute" value=""/><br> <parameter key="attributes" value=""/><br> <parameter key="use_except_expression" value="false"/><br> <parameter key="value_type" value="attribute_value"/><br> <parameter key="use_value_type_exception" value="false"/><br> <parameter key="except_value_type" value="time"/><br> <parameter key="block_type" value="attribute_block"/><br> <parameter key="use_block_type_exception" value="false"/><br> <parameter key="except_block_type" value="value_matrix_row_start"/><br> <parameter key="invert_selection" value="false"/><br> <parameter key="include_special_attributes" value="false"/><br> <parameter key="random_attributes" value="0"/><br> <parameter key="label_noise" value="0.01"/><br> <parameter key="default_attribute_noise" value="0.0"/><br> <list key="noise"/><br> <parameter key="offset" value="0.0"/><br> <parameter key="linear_factor" value="1.0"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> </operator><br> <operator activated="true" class="split_data" compatibility="9.2.000" expanded="true" height="103" name="Split Data (2)" width="90" x="313" y="289"><br> <enumeration key="partitions"><br> <parameter key="ratio" value="0.7"/><br> <parameter key="ratio" value="0.3"/><br> </enumeration><br> <parameter key="sampling_type" value="automatic"/><br> <parameter key="use_local_random_seed" value="true"/><br> <parameter key="local_random_seed" value="1992"/><br> </operator><br> <operator activated="true" class="generate_id" compatibility="9.2.000" expanded="true" height="82" name="Generate ID" width="90" x="581" y="442"><br> <parameter key="create_nominal_ids" value="false"/><br> <parameter key="offset" value="0"/><br> </operator><br> <operator activated="true" class="multiply" compatibility="9.2.000" expanded="true" height="103" name="Multiply" width="90" x="447" y="187"/><br> <operator activated="true" class="model_simulator:automatic_feature_engineering" compatibility="9.2.000" expanded="true" height="103" name="Automatic Feature Engineering" width="90" x="581" y="34"><br> <parameter key="mode" value="feature selection and generation"/><br> <parameter key="balance for accuracy" value="1.0"/><br> <parameter key="show progress dialog" value="true"/><br> <parameter key="use_local_random_seed" value="true"/><br> <parameter key="local_random_seed" value="1992"/><br> <parameter key="use optimization heuristics" value="false"/><br> <parameter key="maximum generations" value="100"/><br> <parameter key="population size" value="30"/><br> <parameter key="use multi-starts" value="true"/><br> <parameter key="number of multi-starts" value="5"/><br> <parameter key="generations until multi-start" value="10"/><br> <parameter key="use time limit" value="true"/><br> <parameter key="time limit in seconds" value="60"/><br> <parameter key="use subset for generation" value="false"/><br> <parameter key="maximum function complexity" value="6"/><br> <parameter key="use_plus" value="false"/><br> <parameter key="use_diff" value="false"/><br> <parameter key="use_mult" value="true"/><br> <parameter key="use_div" value="true"/><br> <parameter key="reciprocal_value" value="true"/><br> <parameter key="use_square_roots" value="true"/><br> <parameter key="use_exp" value="false"/><br> <parameter key="use_log" value="false"/><br> <parameter key="use_absolute_values" value="true"/><br> <parameter key="use_sgn" value="false"/><br> <parameter key="use_min" value="false"/><br> <parameter key="use_max" value="false"/><br> <process expanded="true"><br> <operator activated="true" class="split_data" compatibility="9.2.000" expanded="true" height="103" name="Split Data" width="90" x="45" y="136"><br> <enumeration key="partitions"><br> <parameter key="ratio" value="0.7"/><br> <parameter key="ratio" value="0.3"/><br> </enumeration><br> <parameter key="sampling_type" value="automatic"/><br> <parameter key="use_local_random_seed" value="true"/><br> <parameter key="local_random_seed" value="1992"/><br> </operator><br> <operator activated="true" class="h2o:generalized_linear_model" compatibility="9.2.000" expanded="true" height="124" name="Generalized Linear Model" width="90" x="179" y="34"><br> <parameter key="family" value="AUTO"/><br> <parameter key="link" value="family_default"/><br> <parameter key="solver" value="AUTO"/><br> <parameter key="reproducible" value="false"/><br> <parameter key="maximum_number_of_threads" value="4"/><br> <parameter key="use_regularization" value="false"/><br> <parameter key="lambda" value="1.0"/><br> <parameter key="lambda_search" value="false"/><br> <parameter key="number_of_lambdas" value="0"/><br> <parameter key="lambda_min_ratio" value="0.0"/><br> <parameter key="early_stopping" value="true"/><br> <parameter key="stopping_rounds" value="3"/><br> <parameter key="stopping_tolerance" value="0.001"/><br> <parameter key="alpha" value="1.0"/><br> <parameter key="standardize" value="true"/><br> <parameter key="non-negative_coefficients" value="false"/><br> <parameter key="add_intercept" value="true"/><br> <parameter key="compute_p-values" value="false"/><br> <parameter key="remove_collinear_columns" value="false"/><br> <parameter key="missing_values_handling" value="MeanImputation"/><br> <parameter key="max_iterations" value="0"/><br> <parameter key="specify_beta_constraints" value="false"/><br> <list key="beta_constraints"/><br> <parameter key="max_runtime_seconds" value="0"/><br> <list key="expert_parameters"/><br> </operator><br> <operator activated="true" class="apply_model" compatibility="9.2.000" expanded="true" height="82" name="Apply Model" width="90" x="380" y="136"><br> <list key="application_parameters"/><br> <parameter key="create_view" value="false"/><br> </operator><br> <operator activated="true" class="performance_regression" compatibility="9.2.000" expanded="true" height="82" name="Performance" width="90" x="514" y="136"><br> <parameter key="main_criterion" value="root_mean_squared_error"/><br> <parameter key="root_mean_squared_error" value="true"/><br> <parameter key="absolute_error" value="false"/><br> <parameter key="relative_error" value="false"/><br> <parameter key="relative_error_lenient" value="false"/><br> <parameter key="relative_error_strict" value="false"/><br> <parameter key="normalized_absolute_error" value="false"/><br> <parameter key="root_relative_squared_error" value="false"/><br> <parameter key="squared_error" value="false"/><br> <parameter key="correlation" value="false"/><br> <parameter key="squared_correlation" value="false"/><br> <parameter key="prediction_average" value="false"/><br> <parameter key="spearman_rho" value="false"/><br> <parameter key="kendall_tau" value="false"/><br> <parameter key="skip_undefined_labels" value="true"/><br> <parameter key="use_example_weights" value="true"/><br> </operator><br> <connect from_port="example set source" to_op="Split Data" to_port="example set"/><br> <connect from_op="Split Data" from_port="partition 1" to_op="Generalized Linear Model" to_port="training set"/><br> <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/><br> <connect from_op="Generalized Linear Model" from_port="model" to_op="Apply Model" to_port="model"/><br> <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/><br> <connect from_op="Performance" from_port="performance" to_port="performance sink"/><br> <portSpacing port="source_example set source" spacing="0"/><br> <portSpacing port="sink_performance sink" spacing="0"/><br> </process><br> </operator><br> <operator activated="true" class="multiply" compatibility="9.2.000" expanded="true" height="103" name="Multiply (2)" width="90" x="715" y="34"/><br> <operator activated="true" class="model_simulator:apply_feature_set" compatibility="9.2.000" expanded="true" height="82" name="Apply Feature Set" width="90" x="849" y="187"><br> <parameter key="handle missings" value="true"/><br> <parameter key="keep originals" value="false"/><br> <parameter key="originals special role" value="true"/><br> </operator><br> <operator activated="true" class="h2o:generalized_linear_model" compatibility="9.2.000" expanded="true" height="124" name="Generalized Linear Model (2)" width="90" x="983" y="187"><br> <parameter key="family" value="AUTO"/><br> <parameter key="link" value="family_default"/><br> <parameter key="solver" value="AUTO"/><br> <parameter key="reproducible" value="false"/><br> <parameter key="maximum_number_of_threads" value="4"/><br> <parameter key="use_regularization" value="false"/><br> <parameter key="lambda_search" value="false"/><br> <parameter key="number_of_lambdas" value="0"/><br> <parameter key="lambda_min_ratio" value="0.0"/><br> <parameter key="early_stopping" value="true"/><br> <parameter key="stopping_rounds" value="3"/><br> <parameter key="stopping_tolerance" value="0.001"/><br> <parameter key="standardize" value="true"/><br> <parameter key="non-negative_coefficients" value="false"/><br> <parameter key="add_intercept" value="true"/><br> <parameter key="compute_p-values" value="false"/><br> <parameter key="remove_collinear_columns" value="false"/><br> <parameter key="missing_values_handling" value="MeanImputation"/><br> <parameter key="max_iterations" value="0"/><br> <parameter key="specify_beta_constraints" value="false"/><br> <list key="beta_constraints"/><br> <parameter key="max_runtime_seconds" value="0"/><br> <list key="expert_parameters"/><br> </operator><br> <operator activated="true" class="multiply" compatibility="9.2.000" expanded="true" height="103" name="Multiply (3)" width="90" x="715" y="442"/><br> <operator activated="true" class="model_simulator:apply_feature_set" compatibility="9.2.000" expanded="true" height="82" name="Apply Feature Set (2)" width="90" x="849" y="340"><br> <parameter key="handle missings" value="true"/><br> <parameter key="keep originals" value="false"/><br> <parameter key="originals special role" value="true"/><br> </operator><br> <operator activated="true" class="apply_model" compatibility="9.2.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="1117" y="340"><br> <list key="application_parameters"/><br> <parameter key="create_view" value="false"/><br> </operator><br> <operator activated="true" class="concurrency:join" compatibility="9.2.000" expanded="true" height="82" name="Join" width="90" x="1251" y="442"><br> <parameter key="remove_double_attributes" value="true"/><br> <parameter key="join_type" value="inner"/><br> <parameter key="use_id_attribute_as_key" value="true"/><br> <list key="key_attributes"/><br> <parameter key="keep_both_join_attributes" value="false"/><br> </operator><br> <connect from_op="Generate Data" from_port="output" to_op="Add Noise" to_port="example set input"/><br> <connect from_op="Add Noise" from_port="example set output" to_op="Split Data (2)" to_port="example set"/><br> <connect from_op="Split Data (2)" from_port="partition 1" to_op="Multiply" to_port="input"/><br> <connect from_op="Split Data (2)" from_port="partition 2" to_op="Generate ID" to_port="example set input"/><br> <connect from_op="Generate ID" from_port="example set output" to_op="Multiply (3)" to_port="input"/><br> <connect from_op="Multiply" from_port="output 1" to_op="Automatic Feature Engineering" to_port="example set in"/><br> <connect from_op="Multiply" from_port="output 2" to_op="Apply Feature Set" to_port="example set"/><br> <connect from_op="Automatic Feature Engineering" from_port="feature set" to_op="Multiply (2)" to_port="input"/><br> <connect from_op="Multiply (2)" from_port="output 1" to_op="Apply Feature Set" to_port="feature set"/><br> <connect from_op="Multiply (2)" from_port="output 2" to_op="Apply Feature Set (2)" to_port="feature set"/><br> <connect from_op="Apply Feature Set" from_port="example set" to_op="Generalized Linear Model (2)" to_port="training set"/><br> <connect from_op="Generalized Linear Model (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/><br> <connect from_op="Multiply (3)" from_port="output 1" to_op="Apply Feature Set (2)" to_port="example set"/><br> <connect from_op="Multiply (3)" from_port="output 2" to_op="Join" to_port="right"/><br> <connect from_op="Apply Feature Set (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/><br> <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Join" to_port="left"/><br> <connect from_op="Apply Model (2)" from_port="model" to_port="result 1"/><br> <connect from_op="Join" from_port="join" to_port="result 2"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="315"/><br> <portSpacing port="sink_result 3" spacing="0"/><br> </process><br> </operator><br></process><br>
2 -
Thanks a lot Ingo. I am interested in the following:
1. I have a set of data points (x1, x2, x3...) with a corresponding output (y1)
2. I need to derive a relation (in the form of an equation) that links x1, x2, x3 to y1 such that I can predict the output for any inputs variables.
3. Can I do this in RM? If yes, is there a simple example I/my graduate students can follow?
4. Your youtube videos are very helpful! Thanks!0 -
Hi @mznThanks for your kind wordsThe process above is a cool example, but maybe not simple enough. Pretty much machine learning models in RapidMiner can be used for this task, but maybe I would go with a simple linear regression first. The process below shows a simple example for this. If you use the Model Simulator like I do in this example, the students can even play around with some of the inputs and see how the model reacts. You can see the Simulator in this video (around minute 6:40): https://academy.rapidminer.com/learn/video/auto-model-classificationMore helpful videos on this can be found here: https://academy.rapidminer.com/catalog?label=search&value=regressionHope this helps,
Ingo<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="UTF-8"/><br> <process expanded="true"><br> <operator activated="true" class="generate_data" compatibility="9.2.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34"><br> <parameter key="target_function" value="sum"/><br> <parameter key="number_examples" value="1000"/><br> <parameter key="number_of_attributes" value="5"/><br> <parameter key="attributes_lower_bound" value="-10.0"/><br> <parameter key="attributes_upper_bound" value="10.0"/><br> <parameter key="gaussian_standard_deviation" value="10.0"/><br> <parameter key="largest_radius" value="10.0"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> <parameter key="datamanagement" value="double_array"/><br> <parameter key="data_management" value="auto"/><br> </operator><br> <operator activated="true" class="add_noise" compatibility="9.2.000" expanded="true" height="103" name="Add Noise" width="90" x="179" y="34"><br> <parameter key="return_preprocessing_model" value="false"/><br> <parameter key="create_view" value="false"/><br> <parameter key="attribute_filter_type" value="all"/><br> <parameter key="attribute" value=""/><br> <parameter key="attributes" value=""/><br> <parameter key="use_except_expression" value="false"/><br> <parameter key="value_type" value="attribute_value"/><br> <parameter key="use_value_type_exception" value="false"/><br> <parameter key="except_value_type" value="time"/><br> <parameter key="block_type" value="attribute_block"/><br> <parameter key="use_block_type_exception" value="false"/><br> <parameter key="except_block_type" value="value_matrix_row_start"/><br> <parameter key="invert_selection" value="false"/><br> <parameter key="include_special_attributes" value="false"/><br> <parameter key="random_attributes" value="5"/><br> <parameter key="label_noise" value="0.05"/><br> <parameter key="default_attribute_noise" value="0.0"/><br> <list key="noise"/><br> <parameter key="offset" value="0.0"/><br> <parameter key="linear_factor" value="1.0"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> </operator><br> <operator activated="true" class="split_data" compatibility="9.2.000" expanded="true" height="103" name="Split Data" width="90" x="313" y="187"><br> <enumeration key="partitions"><br> <parameter key="ratio" value="0.7"/><br> <parameter key="ratio" value="0.3"/><br> </enumeration><br> <parameter key="sampling_type" value="automatic"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> </operator><br> <operator activated="true" class="linear_regression" compatibility="9.2.000" expanded="true" height="103" name="Linear Regression" width="90" x="447" y="34"><br> <parameter key="feature_selection" value="none"/><br> <parameter key="alpha" value="0.05"/><br> <parameter key="max_iterations" value="10"/><br> <parameter key="forward_alpha" value="0.05"/><br> <parameter key="backward_alpha" value="0.05"/><br> <parameter key="eliminate_colinear_features" value="true"/><br> <parameter key="min_tolerance" value="0.05"/><br> <parameter key="use_bias" value="true"/><br> <parameter key="ridge" value="1.0E-8"/><br> </operator><br> <operator activated="true" class="apply_model" compatibility="9.2.000" expanded="true" height="82" name="Apply Model" width="90" x="581" y="238"><br> <list key="application_parameters"/><br> <parameter key="create_view" value="false"/><br> </operator><br> <operator activated="true" class="model_simulator:model_simulator" compatibility="9.2.000" expanded="true" height="103" name="Model Simulator" width="90" x="782" y="136"/><br> <connect from_op="Generate Data" from_port="output" to_op="Add Noise" to_port="example set input"/><br> <connect from_op="Add Noise" from_port="example set output" to_op="Split Data" to_port="example set"/><br> <connect from_op="Split Data" from_port="partition 1" to_op="Linear Regression" to_port="training set"/><br> <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/><br> <connect from_op="Linear Regression" from_port="model" to_op="Apply Model" to_port="model"/><br> <connect from_op="Linear Regression" from_port="exampleSet" to_op="Model Simulator" to_port="training data"/><br> <connect from_op="Apply Model" from_port="labelled data" to_op="Model Simulator" to_port="test data"/><br> <connect from_op="Apply Model" from_port="model" to_op="Model Simulator" to_port="model"/><br> <connect from_op="Model Simulator" from_port="simulator output" to_port="result 1"/><br> <connect from_op="Model Simulator" from_port="model output" to_port="result 2"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="105"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> <portSpacing port="sink_result 3" spacing="0"/><br> </process><br> </operator><br></process>
1