"Non parametric regression"

Question

Hi Everybody,

I have some questions whether it's feasible to get something like partial derivates for certain data quantiles via rapidminer. Before I explain my idea and problem I will give you a brief background on my data.

I currently doing an analyses on aggregated data on structural change in German agriculture. The use of aggregate data implies some drawbacks e.g. the cause - effect relations actually exist only at the individual level. Therefore on the aggregate level a closed theoretical model (esp. refering to functional form of the relation) between the dependent and inpedendent variables is not available. Furthermore some information on the aggregate level can at best be conceived as a rough dummy for the factor influencing the individual decision. 
The linear and nested linear regressions show that for some variables the relation between indendent and dependent variable is clearly non-linear and may even show some breaks.

My idea for the analysis is the following:
a) take the data set and remove the outlier's at least the most dramatic ones based on an indicator e.g. Cook's D.
b) conduct a non-parametric regression using either a SVM or nearest neighbor approach (which looks to me as being the most equivalent to what is generally refered to kernel based regression; in the use of: http://en.wikipedia.org/wiki/Kernel_regression).
c) get the information on the partial derivates (first order would be sufficient) across the range of the variable.
d) investigate these derivates for marked non-linearities

a) and b) is quite straightforward but is it possible to do c) and d) in RapidMiner and if how?

Best Norbert

land · Answer

Hi Norbert,
again the day starts with an interessting issue :)
If I understood everything correctly (but it's a complex problem, so I might got something wrong), this should be feasible with rapid miner. But it will need a complex process with nested ExampleIteration, AttributeConstruction, MacroDefinition and ParameterIteration and several Learner...

Unfortunately the design of this process exceeds the scope of this forum, as already the hole topic did in some way. I would love to, but I cannot spend a few hours of my working time for this for free. Since our software is open source, we are living from consulting...If you are interested in consulting or another of our services, please email or phone us.

But now enough cheap advertising :)

Greetings,
  Sebastian

Senecio · Answer

Hi Sebastian,

Actually I'm not half as ambitious as you assume. Currently, I would be glad to have just the main effects, i.e. assuming all interaction terms between the variables are zero. Perhaps, one could (should) extent the setting to some simple interactions between two variables. As a result one would only change one variable at a time; while for the remaining the calculation of the dependent variable would be based on the real values and  an interpolation (average) of several observation. So the data demand would not be quite as challenging. 
To get back to your setting:
For each of the 20 variables one would take 100 measurements (each separated by a 1% quantile of the respective range) and each measurement would be based on sample of 100 observations.
This results in 10 million differences (20 * 100 * 100 * 50 (ok should 49.5)  ;) )  to calculate. Personally, I think to base the estimation of the partial effect at each quantile point on 5000 measurements is really not necessary. A few points less should suffice.  ;)
So I think a modern computer should be able to handle the problem.

Best

Norbert

land · Answer

Hi Norbert,
If you have 20 or more attributes, I doubt you could store the derivatives anyway. Symbolic probably wont fit into a humans brain and calculating a 20 dimensional lattice of numerical values of the derivatives could be either hard to comprehend for humans, too, or simply exceeding the memory. If you only use 100 points on each dimension's range, this would be 100^20 values, or formulated different little less than 2^139 values. Even modern 64 bit machines could struggle here :)

Or did I understood anything wrong in your setting?

Greetings,
  Sebastian