Load intensive processes and operators - RM Server autoscaling testing

Answers
-
Hey @Nikouy ,
I feel loops are one of the easiest way to check out the exhaustion of the memory in RM. Especially, if we deactivate the parallel execution.
Try the below process. Also, please share the results, I am interested in understanding the auto scaling aspect too.<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="generate_data" compatibility="9.6.000" expanded="true" height="68" name="Generate Data" width="90" x="112" y="34"> <parameter key="target_function" value="random"/> <parameter key="number_examples" value="1000000"/> <parameter key="number_of_attributes" value="50"/> <parameter key="attributes_lower_bound" value="-10.0"/> <parameter key="attributes_upper_bound" value="10.0"/> <parameter key="gaussian_standard_deviation" value="10.0"/> <parameter key="largest_radius" value="10.0"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> </operator> <operator activated="true" class="extract_macro" compatibility="9.6.000" expanded="true" height="68" name="Extract Macro" width="90" x="380" y="34"> <parameter key="macro" value="total_i"/> <parameter key="macro_type" value="number_of_examples"/> <parameter key="statistics" value="average"/> <parameter key="attribute_name" value=""/> <list key="additional_macros"/> </operator> <operator activated="true" class="concurrency:loop" compatibility="9.6.000" expanded="true" height="82" name="Loop" width="90" x="581" y="34"> <parameter key="number_of_iterations" value="%{total_i}"/> <parameter key="iteration_macro" value="i"/> <parameter key="reuse_results" value="false"/> <parameter key="enable_parallel_execution" value="false"/> <process expanded="true"> <operator activated="true" class="filter_example_range" compatibility="9.6.000" expanded="true" height="82" name="Filter Example Range" width="90" x="380" y="34"> <parameter key="first_example" value="%{i}"/> <parameter key="last_example" value="%{i}"/> <parameter key="invert_filter" value="false"/> </operator> <operator activated="true" class="generate_attributes" compatibility="9.6.000" expanded="true" height="82" name="Generate Attributes" width="90" x="581" y="34"> <list key="function_descriptions"> <parameter key="junk" value="att1+att10+att11"/> </list> <parameter key="keep_all" value="true"/> </operator> <connect from_port="input 1" to_op="Filter Example Range" to_port="example set input"/> <connect from_op="Filter Example Range" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/> <connect from_op="Generate Attributes" from_port="example set output" to_port="output 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <connect from_op="Generate Data" from_port="output" to_op="Extract Macro" to_port="example set"/> <connect from_op="Extract Macro" from_port="example set" to_op="Loop" to_port="input 1"/> <connect from_op="Loop" from_port="output 1" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="147"/> </process> </operator> </process>
2 -
Hey @hbajpai.Thanks for your input. I tried this process in my laptop and almost fries it! I'll be giving it a try in my Cluster and share my findings here! Still trying to figure out how to fix the Kubernetes DNS, so the loadbalancer redirects the requests to multiple server (for high availability).Community, is there any way I can do something similar using superviserd algorithms? Thinking of using some large data set from UCI.Thanks,Nicolas1
-
Hi @Nikouy,
can you maybe explain why you are doing this? We are running some tests like this internally of course. But what do you try to get out of it?
Best,
Martin0 -
Hi @mschmitz ,I am currently undertaking a research project as part of my MSc dissertation. Rapidminer is the focus of my project, which answers to a call from the scientific and big data community to “develop scalable higher-level models” (Elshawi et al., 2018) and thus help those with needs to automate the flexible scaling of infrastructure (Zhao et al., 2015).Therefore, I am exploring how to deploy an auto-scalable Rapidminer fleet in the cloud, using Kubernetes and provide a reference architecture.Obviously, I will need to test the system after its implementation to demonstrate high-availability and scalability, and I would like to do so using real data sets and various algorithms in order to understand how it behaves under different circumstances or test cases.Thanks,Nicolas
1 -
Here is something to keep in mind for your research:
(Apologies in advance for the length of this)
RapidMiner is not a HPC (High Performance Computing) system but rather a Blackboard System. The difference between the two is fundamental to choose the kind of processes you will need to include. Let’s review the differences between the two:
A HPC system is a distributed system that works at operating system level or root-enabled service level, depending on the implementation. HPC systems assign processor resources and memory upon creation of the service. If you have a multiprocessing-based implementation of a chess game engine, it will distribute processes until the resources are exhausted. In that case, vertical scaling (e.g. adding RAM and processors) is difficult because typically not only the physical installation of hardware needs to be done but also some server reconfiguration (on Linux, this is typically done via modifying the sysctl values), and it would be better to add more nodes (horizontal scaling). You will probably not find data science suites that use this kind of system because most of the code is written directly to make good use of every single processor cycle (because they are needed to calculate stuff as complex as Navier-Stokes equation systems).
A Blackboard system, pattern or architecture works at user level or user-enabled service level. Blackboard systems don’t normally control processor resources or memory but rather have predefined agents (either inside the same software as thread pools or as external resources). Blackboard systems distribute processes through a server that creates a queue on each agent (normally a database that is checked constantly, or a queuing system like Redis). In that case, vertical scaling (e.g. adding RAM and processors) is a matter of changing a few variables in the agent and restart. In the case of RapidMiner, the RapidMiner Server controls the job agents and real time scoring agents through its database. Almost all the data science suites use variants of this architecture (there are plenty) because these are easier to maintain (you don’t require a data scientist who is also a super expert senior black belt ninja sensei in parallel processing, which is a dark black unicorn among the unicorns) and since there is a single non-volatile storage available it is easier to work with large data.
Now… what does this mean for you?
Deactivating parallel processing on a single computer only means that all the processing will be done in a second thread inside RapidMiner Studio (to not make the GUI unresponsive) and since the process is large, it will probably make lots of resource blocking internally, that’s what fries your computer. You should parallelize when you can, for your own sanity. Now, depending on the version of Studio that you have, you should check how many threads can be opened (each thread has its own core from your processor. Therefore in my AMD ThreadRipper with 48 cores I can run 46 calculation threads plus the one for the program and one for the operating system, and in my i9 with 16 cores I can only do 14 calculation threads plus the two aforementioned ones.
Activating parallel processing on RapidMiner Server won’t assign more resources automatically. Instead, you should be seeking for tasks that you can deliver to your servers through an operator that is special for that: the Schedule Process operator does exactly that. Vertically scaling means you can actually launch more job agents on a certain machine and that’s it, or configure the same job agents to have more processors and RAM; horizontally scaling only means you can launch more job agents in different machines.
With that said, I would recommend you to:
1. Take time to train and test a model using RapidMiner Studio. It will be painful if you don’t do it well, but since what you want to test is how to scale things, it wouldn’t be a problem to use… don’t know, a downsampled dataset.
2. Store the model on RapidMiner Server.
3. Create a process that performs a loop over your data and performs one “Schedule Process” operator per record.
4. See how each node is working. Make sure you measure things using SNMP if you can, because that will give you a broader picture on consumption.
I would recommend you this dataset to do so.
https://plg.uwaterloo.ca/~gvcormac/treccorpus07/about.html
That’s it, my two cents.
Hope this helps,
Rod.
2 -
Rodrigo,Thank you for taking the time to write such a detailed reply and highliting the differences between HPC and Blackboard Systems. Using parallel processing is something I consider key, therefore the reason why I was asking which algorithms (either supervided or non supervised) would make good use of paralell processing so I could simply focus in one or two processes at max.I didn't quite get the point number 3 you made, so I'd appreciate if you could expand on this. What would I be ahieving with this?Thanks again,Nicolas0
-
Sure!
I understand that you are launching more agents with Kubernetes on demand depending on the process, am I right?
When you use a local process that requires parallel work, RapidMiner launches these parallel processes in the same machine. What processes can do that?
· Looping with “use parallel execution”.
· Cross validation.
· Feature selection.
When you do such a thing on RapidMiner Server, it does the same (parallel processes in the same machine), the same processes are applied.
But if you are talking about horizontal scaling (adding more machines), your processes need to be ready to send data to other RapidMiner agents, and that is done by creating a process that can be scheduled through the server. For horizontal scaling, you should invoke “Schedule Process” in a loop, and Cross Validation and Feature Selection can no longer be parallelized on many servers.
Basically that’s the reason on why (my humble opinion) I think you might want to focus on scoring with a previously trained model: it will be easier for you to research on horizontal and vertical scaling. If you want to discuss this in private, drop me a line.
All the best,
Rod1 -
Thanks Rodrigo, totally makes sense
. I'll probably be reaching out.
@hbajpai, I tried executing in the server the process that you suggested but it looks (for some reason) that Studio ends picking it up? Please see the screenshot below from my laptop. I did not see any load increase at all in the server.Thanks,Nicolas
0