Bringing arbitrary mathematical functions to RapidMiner for generating data sets
Hello,
I have just started writing a thesis at my university where I am supposed to make an analysis of the t-test and my first assignment is to get a bit more familiar with the t-Test Operator which is implemented in RapidMiner and how it actually works.
I should probably mention right away that my knowledge in statistics and hypothesis testing is still rather limited at this point because I am studying mechanical engineering and statistics are not really a big part of our curriculum.
So what I would like to do right now is:
1) generate a data set using a mathematical test function of my choice
2) add noise on that previously created data set
3) build estimation models using the already implemented learning functions like linear regression/polynomial regression etc. and use cross-validation for performance evaluation
3.2) also import the mathematical function that was used before to generate the data for performance measurement
4) perform a t-test using the performance results provided by the different cross-validation operators
So basically what I want to do is to generate data from a mathematical function, add some noise onto that data and then see how well the estimation performance turns out to be if I use the same function that was used to generate the data for performance evaluation.
Let me explain step by step and point out where I need help:
1) generate a data set using a mathematical test function of my choice
I know that I could also do this using Excel and then import the Excel sheet into RapidMiner, but I would like to know if there is a way to directly import/implement a mathematical function.
For example the Rosenbrock function which is F(x,y) = (a-x)²+b*(y-x²)²
or the Three-hump camel function F(x,y) = 2x²-1,05*x^4+(x^6/6)+x*y+y²
I found the operator "Generate daty by User specification", but unfortunately this operator only creates exactly one example and looping it didn't really seem to work because I could not find a way to create one big excel sheet containing all the examples that were generated by the looping operaor.
The standard "generate data" operator lets me choose from a range of different preset functions and I thought about tweaking the java code of one of those operators in order to replace one of the preset ones with the function that I want but unfortunately I am not that familiar with java either and I don't know how I would have to tweak the program so that it would allow me to set two different value ranges for the two variables x and y. The generate data operator only allows to set one range for all attributes.
2) adding noise on the previously generated data
Here I am planning on using the "add noise" operator so that should not be a problem once I have my data set.
3.1) performance evaluation using already existing regression operators etc.
This should also cause no troubles because here I would only use operators that already exist within RapidMiner.
3.2) performance evaluation using the function that was originally used to generate the data set
This is the second part where I need some help. I know that there is a function called "import model" where I can import for example an xml file which contains my previously used function as a model, but how exactly can I generate such a model-xml file in RapidMiner? Is there some sort of tool or operator that directly "converts" a mathematical function into an equivalent model?
4) performing a t-test
Here I might also need help but that depends on the outcome of the previous steps so it dosn't make much sense to cover it right now.
I would really appreciate some help and I hope my attempt to explain what I am trying to do was comprehensive enough.
I have just started writing a thesis at my university where I am supposed to make an analysis of the t-test and my first assignment is to get a bit more familiar with the t-Test Operator which is implemented in RapidMiner and how it actually works.
I should probably mention right away that my knowledge in statistics and hypothesis testing is still rather limited at this point because I am studying mechanical engineering and statistics are not really a big part of our curriculum.
So what I would like to do right now is:
1) generate a data set using a mathematical test function of my choice
2) add noise on that previously created data set
3) build estimation models using the already implemented learning functions like linear regression/polynomial regression etc. and use cross-validation for performance evaluation
3.2) also import the mathematical function that was used before to generate the data for performance measurement
4) perform a t-test using the performance results provided by the different cross-validation operators
So basically what I want to do is to generate data from a mathematical function, add some noise onto that data and then see how well the estimation performance turns out to be if I use the same function that was used to generate the data for performance evaluation.
Let me explain step by step and point out where I need help:
1) generate a data set using a mathematical test function of my choice
I know that I could also do this using Excel and then import the Excel sheet into RapidMiner, but I would like to know if there is a way to directly import/implement a mathematical function.
For example the Rosenbrock function which is F(x,y) = (a-x)²+b*(y-x²)²
or the Three-hump camel function F(x,y) = 2x²-1,05*x^4+(x^6/6)+x*y+y²
I found the operator "Generate daty by User specification", but unfortunately this operator only creates exactly one example and looping it didn't really seem to work because I could not find a way to create one big excel sheet containing all the examples that were generated by the looping operaor.
The standard "generate data" operator lets me choose from a range of different preset functions and I thought about tweaking the java code of one of those operators in order to replace one of the preset ones with the function that I want but unfortunately I am not that familiar with java either and I don't know how I would have to tweak the program so that it would allow me to set two different value ranges for the two variables x and y. The generate data operator only allows to set one range for all attributes.
2) adding noise on the previously generated data
Here I am planning on using the "add noise" operator so that should not be a problem once I have my data set.
3.1) performance evaluation using already existing regression operators etc.
This should also cause no troubles because here I would only use operators that already exist within RapidMiner.
3.2) performance evaluation using the function that was originally used to generate the data set
This is the second part where I need some help. I know that there is a function called "import model" where I can import for example an xml file which contains my previously used function as a model, but how exactly can I generate such a model-xml file in RapidMiner? Is there some sort of tool or operator that directly "converts" a mathematical function into an equivalent model?
4) performing a t-test
Here I might also need help but that depends on the outcome of the previous steps so it dosn't make much sense to cover it right now.
I would really appreciate some help and I hope my attempt to explain what I am trying to do was comprehensive enough.