"RapidMiner and R, where an Integration is necessary"

EikeS
EikeS New Altair Community Member
edited November 5 in Community Q&A
Hello,
i wrote my Bachelor Thesis about the Integration of R in RapidMiner and it's potentials for data mining.

While working on this and even after finishing the Thesis, there is on unanswered question.

Are there any and if so, which processes cant be done with RapidMiner Operators and the Integration of R (Execute R Operator) is a must?

Do you have any examples for a situation like this?

best regards
Tagged:

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,

    i think there is never a "must" because you can always write native RM operators. However the experience tells that the obstacle to write R/Python is lower than for native Java.

    For me the use cases to use R or Python are either some file format with no native RM operator is available (or webservices using OAuth2...) or plotting. If you want to produce plots for scientific papers you might prefer R's ggplot over the standart RM plots.

    ~Martin
  • David_A
    David_A New Altair Community Member
    Hello EikeS,

    is it possible to get your thesis somehow (either online or a copy)?
    I'm' always interested in academic works about RapidMiner.
  • DocMusher
    DocMusher New Altair Community Member
    We all are interested I think, an online copy would be great.
    Sven
  • EikeS
    EikeS New Altair Community Member
    Hi again,
    sorry for the long waiting time.

    Actually the thesis is written in german. I guess you guys cannot do much with it.

    Thanks for the replies though.

    If there are any use cases where you think R is much better to use than RM, let me know!

    best regards
    Eike
  • DocMusher
    DocMusher New Altair Community Member
    Hi,
    Even your thesis is written in German, I am very interested reading it. (German=my 4th language)
    Sven
  • EikeS
    EikeS New Altair Community Member
    I sent you a pm.
  • MartinLiebig
    MartinLiebig
    Altair Employee
    i am joining the list! But if you sent it to david i guess i can get a copy. Maybe i need to offer cookies though :D

    @Sven: I didn't know that you speak german :) Good to know!
  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    There are some examples possible only with R, even for data mining tasks.

    For example, Decision Trees and Random Forests in RapidMiner only do classification, not regression. The ones in R do both.

    Also, I particularly like the forecast R package that includes state of the art time series algorithms for automatic forecasting. (ARIMA, ETS, ...)

    Spatial statistics for geographic information is another topic that's not supported in stock RapidMiner. I recently tried an approach with the built-in scripting and some libraries with some encouraging results.
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi Eike,

    i got the chance to have a look on your thesis. I would like to do one remark:
    On page 21 you show how you implement the naive bayes in RM and afterwards you show how you do it in R. However in R you do something different. You learn a NB on data and apply it on the same data with not validation. This is wrong if you want to get a predictive value out of it.

    This runs into a thought i have quite often: A program should allow you to do your things easily. I am not sure how X-Val works in R. But apperently it requires some more work. So this yields to rather quick and dirty than correct. This is a major disadvantage of R over RM.
  • carlos_quintani
    carlos_quintani New Altair Community Member
    Martin:

    I am a big fan of both programs. Without any doubt, Rapidminer is easier to use.

    I teach at an MBA program. At some point in my courses of Stats I used R and students crucify me for subjecting them to such torture.

    I use Rapidminer in a 2nd year course on Data Mining and students have never complained.

    Having admitted that, obviously R has advantages.

    How long does it take for Rapidminer to add a new algorithm to the list of available algorithms?

    I'm waiting for RandomForest for Prediction problems (dependent variable continuous).  And Random Forest is hardly a new method.

    Just to be fair with R, with the right library doing Machine Learning can be fairly straightforward.  Example with the wonderful library CARET:

    # Load caret library
    library(caret)

    # Load Boston Housing Dataset
    boston = read.csv("boston.csv")

    # Use 10-fold X-validation
    crtl = trainControl(method="cv",number=10)

    # Model 1: Plain Vanilla Regression
    model1 = train(MEDV~., data=boston, method="lm", trControl=ctrl)
    # Show results
    model1

    4 lines of code!

    Here I attach code for comparing 4 different models in R ( Linear Regression, Stochastic Gradient Boosting , Random Forest and KNN)

    https://s3.amazonaws.com/mirlitus/DM-INCAE-3.R
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi Carlos,

    i am totally on your side. I can fully understand your line of thought. I am myself a big fan of python, even though it takes me way longer to build things.
    I integrated SKLearns RF into RM with python like this: http://data-analytics.ghost.io/how-to-get-the-best-out-of-python-and-rapidminer/ it should work for R too. I am not expert in R so i can not judge on it.

    I am totally convinced that R is capable of all of the things. I am further convinced that standard problems are easily solveable in R. But i encountered the following problem in Python which drove me nuts:

    I wanted to learn an algorithm together with a preprocessing (Normalization and PCA) in a x-val. I wanted to evalulate the method with a customer performance measure which includes example weights and confidence information. Afterwards i wanted to optimize on that.
    Most of the stuff was done, but you needed to get deep into pipelines of sklearn. The custom scoring function was only possible if you do the x-val in a more handy fashion (k-Fold in sklearn, own class etc.).

    Can you show me how to do
    - Learn Normalization, PCA and Model TOGETHER in an x-val
    - Use a custom scoring function lets say weighted accuracy for confidences > 0.75
    - Optimize this

    I think this is kind of a complex tast. In RM this is easy (at least it feels easy to do this for me).

    Best,
    Martin

    P.S: Thanks for the great discussion :-)