"Question to pairwise T-test in online tutorial"
christian1983
New Altair Community Member
Hello everybody,
in order to understand the use of rapid miner 5.0 in detail, i performed the online tutorial of rapid miner 5.0. In step 25 of the tutorial, the pairwise t-test for model comparison is explained. Here both validation processes don´t use local random seed, but global random seed, so that the splitting of the data at hand is expected to differ from each other. But according to data mining literature, performing of pairwise t-test assumes the same manner of data splitting, so i am wondering , if the same local random seed should be better used, to get the same split in training and test data.
Maybe i didn´t understand the idea behind local and global random seed, so i hope you colud help me concerning pairwise t-test and choosing global or local random seed.
Thank you very much.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Root">
<description><p>Many RapidMiner operators can be used to estimate the performance of a learner, a preprocessing step, or a feature space on one or several data sets. The result of these validation operators is a performance vector collecting the values of a set of performance criteria. For each criterion, the mean value and standard deviation are given. </p> <p>The question is how these performance vectors can be compared? Statistical significance tests like ANOVA or pairwise t-tests can be used to calculate the probability that the actual mean values are different. </p> <p> We assume that you have achieved several performance vectors and want to compare them. In this process we use the same data set for both cross validations (hence the IOMultiplier) and estimate the performance of a linear learning scheme and a RBF based SVM. </p> <p> Run the process and compare the results: the probabilities for a significant difference are equal since only two performance vectors were created. In this case the SVM is probably better suited for the data set at hand since the actual mean values are probably different.</p><p>Please note that performance vectors like all other objects which can be passed between RapidMiner operators can be written into and loaded from a file.</p></description>
<process expanded="true" height="584" width="915">
<operator activated="true" class="generate_data" compatibility="5.0.0" expanded="true" height="60" name="ExampleSetGenerator" width="90" x="45" y="30">
<parameter key="target_function" value="one variable non linear"/>
<parameter key="number_examples" value="80"/>
<parameter key="number_of_attributes" value="1"/>
<parameter key="attributes_lower_bound" value="-40.0"/>
<parameter key="attributes_upper_bound" value="30.0"/>
</operator>
<operator activated="true" class="multiply" compatibility="5.0.0" expanded="true" height="94" name="IOMultiplier_1" width="90" x="179" y="30"/>
<operator activated="true" class="x_validation" compatibility="5.0.0" expanded="true" height="112" name="XValidation" width="90" x="315" y="30">
<parameter key="sampling_type" value="shuffled sampling"/>
<process expanded="true">
<operator activated="true" class="support_vector_machine_libsvm" compatibility="5.0.0" expanded="true" name="LibSVMLearner">
<parameter key="svm_type" value="nu-SVR"/>
<parameter key="C" value="10000.0"/>
<list key="class_weights"/>
</operator>
<connect from_port="training" to_op="LibSVMLearner" to_port="training set"/>
<connect from_op="LibSVMLearner" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="5.0.0" expanded="true" name="ModelApplier">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_regression" compatibility="5.0.0" expanded="true" name="RegressionPerformance">
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="absolute_error" value="true"/>
</operator>
<connect from_port="model" to_op="ModelApplier" to_port="model"/>
<connect from_port="test set" to_op="ModelApplier" to_port="unlabelled data"/>
<connect from_op="ModelApplier" from_port="labelled data" to_op="RegressionPerformance" to_port="labelled data"/>
<connect from_op="RegressionPerformance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="x_validation" compatibility="5.0.0" expanded="true" height="112" name="XValidation (2)" width="90" x="313" y="165">
<parameter key="sampling_type" value="shuffled sampling"/>
<process expanded="true">
<operator activated="true" class="linear_regression" compatibility="5.0.0" expanded="true" name="LinearRegression"/>
<connect from_port="training" to_op="LinearRegression" to_port="training set"/>
<connect from_op="LinearRegression" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="5.0.0" expanded="true" name="ModelApplier (2)">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_regression" compatibility="5.0.0" expanded="true" name="RegressionPerformance (2)">
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="absolute_error" value="true"/>
</operator>
<connect from_port="model" to_op="ModelApplier (2)" to_port="model"/>
<connect from_port="test set" to_op="ModelApplier (2)" to_port="unlabelled data"/>
<connect from_op="ModelApplier (2)" from_port="labelled data" to_op="RegressionPerformance (2)" to_port="labelled data"/>
<connect from_op="RegressionPerformance (2)" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="t_test" compatibility="5.0.0" expanded="true" height="112" name="T-Test" width="90" x="514" y="120"/>
<operator activated="false" class="anova" compatibility="5.0.0" expanded="true" height="76" name="Anova" width="90" x="715" y="165"/>
<connect from_op="ExampleSetGenerator" from_port="output" to_op="IOMultiplier_1" to_port="input"/>
<connect from_op="IOMultiplier_1" from_port="output 1" to_op="XValidation" to_port="training"/>
<connect from_op="IOMultiplier_1" from_port="output 2" to_op="XValidation (2)" to_port="training"/>
<connect from_op="XValidation" from_port="averagable 1" to_op="T-Test" to_port="performance 1"/>
<connect from_op="XValidation (2)" from_port="averagable 1" to_op="T-Test" to_port="performance 2"/>
<connect from_op="T-Test" from_port="significance" to_port="result 2"/>
<connect from_op="T-Test" from_port="performance 1" to_port="result 1"/>
<connect from_op="T-Test" from_port="performance 2" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="18"/>
<portSpacing port="sink_result 2" spacing="90"/>
<portSpacing port="sink_result 3" spacing="18"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>
in order to understand the use of rapid miner 5.0 in detail, i performed the online tutorial of rapid miner 5.0. In step 25 of the tutorial, the pairwise t-test for model comparison is explained. Here both validation processes don´t use local random seed, but global random seed, so that the splitting of the data at hand is expected to differ from each other. But according to data mining literature, performing of pairwise t-test assumes the same manner of data splitting, so i am wondering , if the same local random seed should be better used, to get the same split in training and test data.
Maybe i didn´t understand the idea behind local and global random seed, so i hope you colud help me concerning pairwise t-test and choosing global or local random seed.
Thank you very much.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Root">
<description><p>Many RapidMiner operators can be used to estimate the performance of a learner, a preprocessing step, or a feature space on one or several data sets. The result of these validation operators is a performance vector collecting the values of a set of performance criteria. For each criterion, the mean value and standard deviation are given. </p> <p>The question is how these performance vectors can be compared? Statistical significance tests like ANOVA or pairwise t-tests can be used to calculate the probability that the actual mean values are different. </p> <p> We assume that you have achieved several performance vectors and want to compare them. In this process we use the same data set for both cross validations (hence the IOMultiplier) and estimate the performance of a linear learning scheme and a RBF based SVM. </p> <p> Run the process and compare the results: the probabilities for a significant difference are equal since only two performance vectors were created. In this case the SVM is probably better suited for the data set at hand since the actual mean values are probably different.</p><p>Please note that performance vectors like all other objects which can be passed between RapidMiner operators can be written into and loaded from a file.</p></description>
<process expanded="true" height="584" width="915">
<operator activated="true" class="generate_data" compatibility="5.0.0" expanded="true" height="60" name="ExampleSetGenerator" width="90" x="45" y="30">
<parameter key="target_function" value="one variable non linear"/>
<parameter key="number_examples" value="80"/>
<parameter key="number_of_attributes" value="1"/>
<parameter key="attributes_lower_bound" value="-40.0"/>
<parameter key="attributes_upper_bound" value="30.0"/>
</operator>
<operator activated="true" class="multiply" compatibility="5.0.0" expanded="true" height="94" name="IOMultiplier_1" width="90" x="179" y="30"/>
<operator activated="true" class="x_validation" compatibility="5.0.0" expanded="true" height="112" name="XValidation" width="90" x="315" y="30">
<parameter key="sampling_type" value="shuffled sampling"/>
<process expanded="true">
<operator activated="true" class="support_vector_machine_libsvm" compatibility="5.0.0" expanded="true" name="LibSVMLearner">
<parameter key="svm_type" value="nu-SVR"/>
<parameter key="C" value="10000.0"/>
<list key="class_weights"/>
</operator>
<connect from_port="training" to_op="LibSVMLearner" to_port="training set"/>
<connect from_op="LibSVMLearner" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="5.0.0" expanded="true" name="ModelApplier">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_regression" compatibility="5.0.0" expanded="true" name="RegressionPerformance">
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="absolute_error" value="true"/>
</operator>
<connect from_port="model" to_op="ModelApplier" to_port="model"/>
<connect from_port="test set" to_op="ModelApplier" to_port="unlabelled data"/>
<connect from_op="ModelApplier" from_port="labelled data" to_op="RegressionPerformance" to_port="labelled data"/>
<connect from_op="RegressionPerformance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="x_validation" compatibility="5.0.0" expanded="true" height="112" name="XValidation (2)" width="90" x="313" y="165">
<parameter key="sampling_type" value="shuffled sampling"/>
<process expanded="true">
<operator activated="true" class="linear_regression" compatibility="5.0.0" expanded="true" name="LinearRegression"/>
<connect from_port="training" to_op="LinearRegression" to_port="training set"/>
<connect from_op="LinearRegression" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="5.0.0" expanded="true" name="ModelApplier (2)">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_regression" compatibility="5.0.0" expanded="true" name="RegressionPerformance (2)">
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="absolute_error" value="true"/>
</operator>
<connect from_port="model" to_op="ModelApplier (2)" to_port="model"/>
<connect from_port="test set" to_op="ModelApplier (2)" to_port="unlabelled data"/>
<connect from_op="ModelApplier (2)" from_port="labelled data" to_op="RegressionPerformance (2)" to_port="labelled data"/>
<connect from_op="RegressionPerformance (2)" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="t_test" compatibility="5.0.0" expanded="true" height="112" name="T-Test" width="90" x="514" y="120"/>
<operator activated="false" class="anova" compatibility="5.0.0" expanded="true" height="76" name="Anova" width="90" x="715" y="165"/>
<connect from_op="ExampleSetGenerator" from_port="output" to_op="IOMultiplier_1" to_port="input"/>
<connect from_op="IOMultiplier_1" from_port="output 1" to_op="XValidation" to_port="training"/>
<connect from_op="IOMultiplier_1" from_port="output 2" to_op="XValidation (2)" to_port="training"/>
<connect from_op="XValidation" from_port="averagable 1" to_op="T-Test" to_port="performance 1"/>
<connect from_op="XValidation (2)" from_port="averagable 1" to_op="T-Test" to_port="performance 2"/>
<connect from_op="T-Test" from_port="significance" to_port="result 2"/>
<connect from_op="T-Test" from_port="performance 1" to_port="result 1"/>
<connect from_op="T-Test" from_port="performance 2" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="18"/>
<portSpacing port="sink_result 2" spacing="90"/>
<portSpacing port="sink_result 3" spacing="18"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
-
Hi,
well I think you are right. Seems to be a possibility for improvement in the tutorial processes:/
Greetings,
Sebastian0