"WEKA Multi-core Extension for Rapidminer doesnt seem to work!"

siamak_want
siamak_want New Altair Community Member
edited November 5 in Community Q&A
Hi,

I have downloaded a rapidminer extension with name, "WEKA Multi-core Extension for Rapidminer" from sourceforge. I contains about 62 Multi-core algorithm which I can see that they are added in RM operators list. But unfortunately, When I use these multi-core algorithms the system just exploits 1 core of my system. (My system utilizes a Corei3.0 processor)!

What is wrong here? Is this plugin Approved by RM team?

any help would be appreciated.

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi,

    no, this is not an official plugin supported by Rapid-I, and I personally haven't heard about it so far.
    But maybe you have to set the maximum number of processors to be used somewhere in the preferences (e.g. under Tools->Preferences)?

    Best,
    Marius
  • siamak_want
    siamak_want New Altair Community Member
    HI,

    Marius, Unfortunately, I didn't found any settings for number of CPUs in RM preferences.
    Is there anyone who knows this plugin?

    Thanks
  • wessel
    wessel New Altair Community Member
    Hey,

    My advise would be to simply create an arff file using rapid miner.
    And then run the WEKA multi-core algorithms standalone in WEKA.

    Best regards,

    Wessel
  • siamak_want
    siamak_want New Altair Community Member
    Hi Wessel,

    As you know, RM itself contains Weka Algorithms as "Weka Plugin". Can I use weka plugins in RM for multi-core purposes or maybe you mean that Weka operators which include in Weka plugin of RM are not multi-core as they have been implemented parallel in original weka????

    Please guide me, I feel a little confused.

    Thanks in advance

  • wessel
    wessel New Altair Community Member
    As far as I am aware the Weka plugin implements the normal Weka algorithms, not the parallel Weka algorithms.
  • siamak_want
    siamak_want New Altair Community Member
    Thanks from your clarified answer Wessel. It seems that RM developers hadn't care about Parallel processing at all. They have just implemented parallel decision tree algorithms in "Parallel Processing Plugin"! Any way, Your suggestion made 3 key questions for me:


    1) Can I choose a parallel algorithm from WEKA and  make it a parallel (multi-thread) operator in RM? If the answer is positive, so why RM developers had not implemented such an important feature? If the answer is negative what avoids the developers from doing this.

    2) how can I understand which algorithms are parallel in weka? Shall I read Weka document?

    3) I have read many discussions about comparison of WEKA and RM, but as far as I remember, no one has mentioned that RM lacks the parallel processing capability in compare to WEKA. I mean If WEKA supports Multi-core but RM does not, peoples would have mentioned this issue as a advantage of WEKA against RM.

    Thanks Wessel and every one that has info about the preceding topics.

  • wessel
    wessel New Altair Community Member
    siamak_want wrote:

    Thanks from your clarified answer Wessel. It seems that RM developers hadn't care about Parallel processing at all. They have just implemented parallel decision tree algorithms in "Parallel Processing Plugin"! Any way, Your suggestion made 3 key questions for me:
    There are actually a lot of things you can run in parallel in Rapid Miner.
    It is not so much about care. Parallel programming is very complicated and it is not at all easy to integrate this.

    It might be able to answer your questions, but there seems to be a very negative underlying tone to your questions.
    Would probably be a good idea to rephrase some of them in a more neutral tone so the Rapid Miner team does not have to get angry.

  • siamak_want
    siamak_want New Altair Community Member
    Thanks Wessel,

    I have found RM developers very welcoming to criticism. This is a key attitude for developers of such a  well-defined  and popular software. How ever, I think my questions are important and somehow philosophic questions which their answers may help this nice software to improve. So I think RM developers wont become angry as you said. :)

    Thanks again.
  • wessel
    wessel New Altair Community Member
    Just to get this clear, to what parallel implementation of Weka are you referring to?
    This one?
    http://weka-parallel.sourceforge.net/readme.txt

    "This version of Weka was created with the intention of being able to run
    the cross-validation portion of any given classifier very quickly."

    Note that you can do exactly this using the parallel cross validation implementation from Rapid Miner.
    So if you have a single computer, with many cores, e.g. 2 quad cores, you can run your cross validation 16 times faster.

    Do you have access to a super computer?
    Like lots of fast machines in a network?
    If so, then you can get even faster speedups using the Weka parallel plugin, by running cross validation distributed over multiple machines.
    In theory you can also do this with Rapid Miner, but don't expect it to be easy.
    If you want to do this, and don't already know how to do this, you might get help from the Rapid Miner team, but they will probably ask you to pay some money.

    Best regards,

    Wessel
  • haddock
    haddock New Altair Community Member
    Nice one, Wess!

    We alll know which way is up, sometimes someone needs to remind us.

    Cool/

  • siamak_want
    siamak_want New Altair Community Member
    Hi,

    Thanks to your clear answer. I bought "How to extend RM 5" tutorial because I thought that the parallelism issues have been discussed in it. But unfortunately there was no info about parallelism in this doc!

    You know, I have access to a single machine with 16 cores. My data set is a high dimensional text data set with about 50000 features and 20000 rows. When I run a simple classification (and also cross validation) algorithm on this data set, model training time and also model application time are so so much...

    So I am thinking of multi-threaded implementations for reducing this time. In fact "I want to extend RM in a multi-thread fashion".
    As you mentioned I think I should contact RM team for this kinda info. But I hope that the price that I should pay for this info, will be affordable for a limited university budget.

    I will be grateful if you have any complementary useful info about mentioned topics.

  • wessel
    wessel New Altair Community Member
    Can't you start experimenting with a small subset of your data?

    edit: What you are trying to do is not an easy problem.
    It is possible to use Rapid Miner to process huge data, but this is not for beginners.

    The Rapid Miner team offers data hosting and analysis as a service, I'm sure they will give you a fair price.
    (A fair price as in, not more expensive than other business in the market).

    Since you have bought the Rapid Miner Developer Whitepaper you can read about how to create your own modules.
    So you can now combine your own code with Rapid Miner code.
    But again, threads in Java are not for beginners.
    And maybe your problem is too big to solve on a single machine.
    In the best case scenario you get a 16 times speedup, at the cost of increased memory usage.
    You now have to wait 16 hours, instead of 1 hour?

    Why don't you simply launch multiple instances of Rapid Miner, start 16 processes, and the next day, see what your results look like?
    You now have 16 results, instead of 1.

    Best regards,

    Wessel
  • siamak_want
    siamak_want New Altair Community Member
    Hi Wessel,

    Again thanks to your guides, but there are some points that I need to mention:

    - As you already mentioned, RM "cross validation " is nicely implemented parallel. And it runs the "training and testing of each fold" on a seperate core, but I need an operator which executes training on several cores (and also "model application" on several cores). I don't think that "model application" implemented in parallel by default. [please correct me if I am wrong.]
    - Training an algorithm in parallel is really a demanding and challenging task, but I think "model application" process can be done parallel inherently. So, Is there any way to make the model application parallel? e.g. Should I writie my own multi-thread model applier or there is an out of the box solution in RM?

    - I don't want to pass the data set to RM team, and just get the results... . Because I should deliver the whole project to my supervisor in order to graduate.

    - And in regard with explanation about running 16 separate process of RM, then how should I ensemble the results?

    Thanks again nice guy.



  • wessel
    wessel New Altair Community Member
    Indeed, implementing the training phase in parallel is hard.
    But note that RM comes with a parallel decision tree algorithm!
    Parallel:Decision Tree (Parallel)
    Synopsis
    Learns a pruned decision tree which can handle both numerical and nominal attributes. This implementation might distribute the work over several threads for utilizing the today's multicore CPUs.

    Normally there is no need to do model application in parallel, because nearly all models can execute in O(N).
    Only lazy models, for example k-nearest neighbor, need more time to execute.
    So with nearly all learners (with the exception of lazy learners) the training time dwarfs the model application time.

    Rapid Miner is just Java right?
    Are you able to launch Rapid Miner twice? (This should be really easy).
    So you just hit "run" twice, (in both screens) and wait for the results.
    Alternatively you can install a Rapid Analytics server. http://rapid-i.com/content/view/182/196/

    Best regards,

    Wessel
  • siamak_want
    siamak_want New Altair Community Member
    Hi the nice guy Wessel,

    As you mentioned, decision trees have been implemented nicely parallel in RM, but unfortunately Decision trees can not perform well with high dimensional data (i.e. text data).

    Again as you said, I should try rapid analytics. I really want to try its power on the multi-core server.

    again thanks to your answers which have made this thread interesting for every reader (Its my opinion:)).