Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Inferential Statistics - R, Python or Extension

As a partner, I am looking to use RapidMiner to integrate related inferential statistical methods such as hypothesis testing, confidence intervals, chi-square, etc. as part of a client implementation. I see there is a pay-for extension to do this work, but given the simplicity of these methods and unwanted burden of managing a paid for subscription to integrate these methods for only occasional use, is there a no-charge library of operators available, or do I need to just leverage R or Python and create my own? We only need a few methods for occasional use and I'd like to know if there are other options besides R, Python or the pay-for extension? Thanks!

Find more posts tagged with

AI Studio

Python

Extensions

Accepted answers

michaelgloven

I normally calculate the z test statistic by taking the sample mean (or median) - null hypothesis value (what I'm testing) all divided by the standard error assuming the constraints of the central limit theorem. So, for SE I usually use the sample standard deviation/sq root of samples. I then compare this result with the critical z value (1.65 for a one tail test and level of significance of 5%) to see if I should reject or accept the hypothesis. The math is quite simple, I was just looking for a simple operator to automate the work given how important testing our data and results is to our particular use cases. I believe I can make all of this work with your suggestions above.

All comments

MartinLiebig

Hi Michael,

i've just aded (last thursday) an operator called 'Compare Distributions' to SMILE extension. It provides KS-Test, Chi-Square Test, F-Test and T-Test. Would this already help?

BR,

Martin

michaelgloven

awesome, you're several steps ahead of me as usual. It looks like this will work, and I'll review the documentation. Could you also point me in the right direction for calculating a z test statistic?

MartinLiebig

Hi Michael,

so the idea is to get the number of std-devs from the mean? I think we don't have it yet.

But, Tukey Test in Operator Toolbox is fairly similar, imo superior. It's defined as:

For each selected attribute a confidence of the Tukey Test is calculated. This confidence is defined as the distance between the current value to the median, divided by the distance of the lower/upper 'Tukey Test boundary' to the median.

So instead of mean and std_dev we take Inter quartile range and median. Median is more robust to outliers than mean, so i and many stats-people prefer it.

Can you have a look at Tukey test? We may just write the same stuff but with mean and std_dev if that's what you need.

Cheers,

Martin

michaelgloven

CB123

Hello, I am trying to use the compare distributions operator to do T-tes,F-tes and Kolmogorov, but i can not find the significance level that is being used neither where i can change it.
Thank you in advance

MartinLiebig

Hi CB123,

i can go wrong here, but the operator should return you the statistics and the p-value for this statistics. There is no significance level involved, as far as I remember. Isnt the significance level only used to reject the hypothesis for a given p-value?

Best,

Martin

YYH

Hi @CB123,

in KS test, the KS statistics, p-value will be returned as Dr Martin mentioned above. What is the usual significant level used by you in practice?

The common alpha values (significant level) of 0.05 and 0.01 are simply based on tradition.

When a P value is less than or equal to the significance level, you reject the null hypothesis. If we take the P value from statistical tests and compare it to the common significance levels. For example the P value of 0.03112 is statistically significant at an alpha level of 0.05, but not at the 0.01 level.

KStest http://haifengl.github.io/api/java/smile/stat/hypothesis/KSTest.html

Hope it helps.

YY

CB123

Thank you very much for your answers!
My problem is that I was trying to automatize the steps in T Test and F test, and I need more than the p-value, like the statistics T and F,and the critical region.
Is there any way to calculate columns using the distributions F and T like in excel?

Thank you!