Using RapidMiner H2O operators with existing H2O local server
I have noticed the previous discussion on using CUDA and GPUs with RM so this is a related query. I can run GPU based Tensorflow, Keras and MXNet from the RM Python interface and I can access local H2O server from R and Python. However, using the existing RM H2O operators seems like a very attractive option. So, I wonder if it is possible to configure RM client to connect to an existing H2O server rather than start a new H2O cluster each time an H2O operator is run (per session). In this way, you could link RM client with the H2O server supporting GPUs. Would it be possible without installing the Hadoop, Spark, etc.
I am using an educational license of RapidMiner Studio.
Jacob
Answers
-
That is an interesting idea indeed. Do you know if the H20 server have some sort of API to connect too?
0 -
Yes, the H2O has a range of APIs for different software, including REST API, which I think R and Python use to communicate with the H2O server. See:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/api-reference.html
Jacob
0 -
I suspect RapidMiner is doing it already as it starts a session with a H2O cluster and closes the connection when it exits. All that is needed is to provide the IP address and a port of the server if it exists and running, possibly as an extra section in the preferences?
For example, R passes the following (optional) arguments during its initialisation of the H2O session:
ip
Object of class
character
representing the IP address of the server where H2O is running.port
Object of class
numeric
representing the port number of the H2O server.startH2O
(Optional) A
logical
value indicating whether to try to start H2O from R if no connection with H2O is detected. This is only possible ifip = "localhost"
orip = "127.0.0.1"
. If an existing connection is detected, R does not start H2O.forceDL
(Optional) A
logical
value indicating whether to force download of the H2O executable. Defaults to FALSE, so the executable will only be downloaded if it does not already exist in the h2o R library resources directoryh2o/java/h2o.jar
. This value is only used when R starts H2O.enable_assertions
(Optional) A
logical
value indicating whether H2O should be launched with assertions enabled. Used mainly for error checking and debugging purposes. This value is only used when R starts H2O.license
(Optional) A
character
string value specifying the full path of the license file. This value is only used when R starts H2O.nthreads
(Optional) Number of threads in the thread pool. This relates very closely to the number of CPUs used. -1 means use all CPUs on the host (Default). A positive integer specifies the number of CPUs directly. This value is only used when R starts H2O.
max_mem_size
(Optional) A
character
string specifying the maximum size, in bytes, of the memory allocation pool to H2O. This value must a multiple of 1024 greater than 2MB. Append the letter m or M to indicate megabytes, or g or G to indicate gigabytes. This value is only used when R starts H2O.min_mem_size
(Optional) A
character
string specifying the minimum size, in bytes, of the memory allocation pool to H2O. This value must a multiple of 1024 greater than 2MB. Append the letter m or M to indicate megabytes, or g or G to indicate gigabytes. This value is only used when R starts H2O.ice_root
(Optional) A directory to handle object spillage. The defaul varies by OS.
strict_version_check
(Optional) Setting this to FALSE is unsupported and should only be done when advised by technical support.
proxy
(Optional) A
character
string specifying the proxy path.https
(Optional) Set this to TRUE to use https instead of http.
insecure
(Optional) Set this to TRUE to disable SSL certificate checking.
username
(Optional) Username to login with.
password
(Optional) Password to login with.
cookies
(Optional) Vector(or list) of cookies to add to request.
context_path
(Optional) The last part of connection URL: http://<ip>:<port>/<context_path>
ignore_config
(Optional) A
logical
value indicating whether a search for a .h2oconfig file should be conducted or not. Default value is FALSE.Jacob
0 -
Well RM does not interact with an H20 cluster offsite, it creates those clusters on your machine during process execution.
1 -
That is true, however, once the H2O cluster is created as it happens right now, if the cluster is accessed via REST API during its existence, it is possibly no different from a separate server, which may be running independently of RM? If so, all is needed is a connection to that cluster?
Jacob
0 -
Perhaps in theory it could work but I'm not the best person to speculate on that. Let's ping @Marco_Boeck and see if he has anything to add to this.
0 -
Hi,
not my area I'm afraid. Let's continue pinging people! @phellinger would be the expert on that.
Regards,
Marco
1 -
Thank you Marco, I wonder if @phellinger had a chance to look at this?
Jacob
0 -
Hi,
this is an interesting idea indeed.
However, there is a blocker problem with it: core H2O does not support transferring models between different version (even minor versions!). For example, you cannot apply a model in 3.10.0.7 that was trained in 3.10.0.6, see:
https://0xdata.atlassian.net/browse/PUBDEV-3432
This means that you need the exact same version of the H2O cluster running as RapidMiner integrates.
This was actually a big challenge for us during integration, because RapidMiner is committed that your process should do the same thing even if you upgrade your platform - an important thing if you are running processes in production. To achieve that, even models that were trained with the help of the H2O library become RapidMiner models that work the same way on the entire platform (Studio, Server, Radoop), and they keep their behaviour through upgrades as well. So this is something that RapidMiner brings to the table.
To be able to work together with different versions of H2O clusters, we would need to design something complex to still keep this reliability of RapidMiner processes. If the POJO generation of H2O did not have certain issues, it could help to solving that.
To summarize, this is a good idea, but is way more challenging than it sounds at first, but we will still keep an eye on our options to solve it.
Best,
Peter
2 -
@phellinger Thanks Peter, how about adding an option in the H2O.ai operator to start the (session) cluster with the local GPU support?
Jacob
0