Statistical Significance
ema
New Altair Community Member
Hi all,
I am doing regular classification validation , shown below
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#p#ygt#This process is very similar to the process #yquot#03_XValidation_Numerical.xml#yquot#. The basic process setup is exactly the same, i.e. the first inner operator must produce a model from the given training data set and the second inner operator must be able to handle this model and the test data and must provide a PerformanceVector. #ylt#/p#ygt# In contrast to the previous process we now use a classification learner (J48) which is evaluated by several nominal performance criteria.#ylt#/p#ygt# #ylt#p#ygt# The cross validation building block is very common for many (more complex) RapidMiner processes. However, there are several more validation schemes available in RapidMiner which will be dicussed in the next sample processes. #ylt#/p#ygt#"/>
<parameter key="logfile" value="C:\knn.txt"/>
<operator name="TextInput (4)" class="TextInput" expanded="no">
<list key="texts">
<parameter key="b" value=".."/>
<parameter key="P" value=".."/>
</list>
<parameter key="default_content_encoding" value="utf8"/>
<parameter key="default_content_language" value="utf8"/>
<parameter key="prune_below" value="3"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="true"/>
<operator name="StringTokenizer (4)" class="StringTokenizer">
</operator>
<operator name="TokenLengthFilter (4)" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
</operator>
<operator name="XValidation (3)" class="XValidation" expanded="yes">
<operator name="NearestNeighbors" class="NearestNeighbors">
<parameter key="k" value="3"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<operator name="OperatorChain (3)" class="OperatorChain" expanded="yes">
<operator name="ModelApplier (3)" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="ClassificationPerformance (3)" class="ClassificationPerformance">
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="true"/>
<parameter key="kappa" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<parameter key="spearman_rho" value="true"/>
<parameter key="kendall_tau" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="relative_error" value="true"/>
<parameter key="relative_error_lenient" value="true"/>
<parameter key="relative_error_strict" value="true"/>
<parameter key="normalized_absolute_error" value="true"/>
<parameter key="root_mean_squared_error" value="true"/>
<parameter key="root_relative_squared_error" value="true"/>
<parameter key="squared_error" value="true"/>
<parameter key="correlation" value="true"/>
<parameter key="squared_correlation" value="true"/>
<parameter key="cross-entropy" value="true"/>
<parameter key="margin" value="true"/>
<parameter key="soft_margin_loss" value="true"/>
<parameter key="logistic_loss" value="true"/>
<list key="class_weights">
</list>
</operator>
</operator>
</operator>
</operator>
my question is, other than xvalidation , does rapidminer has any ability to calculate "statistical significance"
Thank you
I am doing regular classification validation , shown below
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#p#ygt#This process is very similar to the process #yquot#03_XValidation_Numerical.xml#yquot#. The basic process setup is exactly the same, i.e. the first inner operator must produce a model from the given training data set and the second inner operator must be able to handle this model and the test data and must provide a PerformanceVector. #ylt#/p#ygt# In contrast to the previous process we now use a classification learner (J48) which is evaluated by several nominal performance criteria.#ylt#/p#ygt# #ylt#p#ygt# The cross validation building block is very common for many (more complex) RapidMiner processes. However, there are several more validation schemes available in RapidMiner which will be dicussed in the next sample processes. #ylt#/p#ygt#"/>
<parameter key="logfile" value="C:\knn.txt"/>
<operator name="TextInput (4)" class="TextInput" expanded="no">
<list key="texts">
<parameter key="b" value=".."/>
<parameter key="P" value=".."/>
</list>
<parameter key="default_content_encoding" value="utf8"/>
<parameter key="default_content_language" value="utf8"/>
<parameter key="prune_below" value="3"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="true"/>
<operator name="StringTokenizer (4)" class="StringTokenizer">
</operator>
<operator name="TokenLengthFilter (4)" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
</operator>
<operator name="XValidation (3)" class="XValidation" expanded="yes">
<operator name="NearestNeighbors" class="NearestNeighbors">
<parameter key="k" value="3"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<operator name="OperatorChain (3)" class="OperatorChain" expanded="yes">
<operator name="ModelApplier (3)" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="ClassificationPerformance (3)" class="ClassificationPerformance">
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="true"/>
<parameter key="kappa" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<parameter key="spearman_rho" value="true"/>
<parameter key="kendall_tau" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="relative_error" value="true"/>
<parameter key="relative_error_lenient" value="true"/>
<parameter key="relative_error_strict" value="true"/>
<parameter key="normalized_absolute_error" value="true"/>
<parameter key="root_mean_squared_error" value="true"/>
<parameter key="root_relative_squared_error" value="true"/>
<parameter key="squared_error" value="true"/>
<parameter key="correlation" value="true"/>
<parameter key="squared_correlation" value="true"/>
<parameter key="cross-entropy" value="true"/>
<parameter key="margin" value="true"/>
<parameter key="soft_margin_loss" value="true"/>
<parameter key="logistic_loss" value="true"/>
<list key="class_weights">
</list>
</operator>
</operator>
</operator>
</operator>
my question is, other than xvalidation , does rapidminer has any ability to calculate "statistical significance"
Thank you
Tagged:
0
Answers
-
Hi Ema,
RapidMiner provides operator for checking if results are statistically significant better compared to others using the operators in the Validation / Significance group. Namely it provides you with an ANOVA and a T-Test operator for comparing performance vectors.
Is that what you searched for?
Greetings,
Sebastian0 -
Hi,
I just download the RapidMiner and was impressed by all the data mining methods in it. However, is it another way to test significance, like Fisher's test? For example, consider a rule:
A1=> A0, i.e., prob(A0|A1) > prob(A0)
we can rewrite it as
prob(A0|A1) *prob(A1) > prob(A0) * prob(A1)
prob(A0&A1) > prob(A0)*prob(A1)
Therefore, we can test hypothesis H0
H0: prob(A0&A1) = prob(A0)*prob(A1)
against alternative hypothesis H1
H1: prob(A0&A1) != prob(A0)*prob(A1)
if H0 is confirmed, then A0=>A1 is not a statistically significant rule.
Any functionability on this test?
0 -
Hi,
where would you like to add this feature? Should it apply to Association Rules or the Rule model? Testing general data mining models could be a little difficult with that, since we don't have a probability there. Or am I misunderstanding something?
Greetings,
Sebastian0 -
Yes, I think it would be useful to add to rulelearner.0
-
Hello lindawu
I understand what you are implying. Speaking as a bayesian, you want to test whether the occurrence of an attribute (or the specific value of a an attribute) is independent of the occurence of another attribute (specific value of another attribute). This is in general a good idea, however...- most learners are constructed in such a way that only significant combinations are weighted more than insignificant ones, to improve overall quality and to reduce overfitting
- I would not care if I had a model containing only insignificant rules (in a sense of a statistical hypothesis test), but which delivers well-tested (!) low error-rates
happy mining,
steffen0