How to run processes from data stored 100% in the cloud?
artavia_eduardo
New Altair Community Member
Hi all.
I've been working with RapidMiner Studio for a while now. Have a little experience working with predictive models and such.
Right now my company is asking me to analyze some medical data from real world patients. However, because of privacy and laws, I can't have these data stored in my physical computer not even for a single minute. I know how to connect my RapidMiner Studio to a SQL Server and access data from the cloud, however, when running a process, the data gets downloaded to my computer.
How would you guys recommend I tackle this issue? Is there a way to use RM 100% in the cloud? or have it access data that is 100% in the cloud? Not sure if RapidMiner Server would help me, I've never used it.
Thank you.
Eduardo.
I've been working with RapidMiner Studio for a while now. Have a little experience working with predictive models and such.
Right now my company is asking me to analyze some medical data from real world patients. However, because of privacy and laws, I can't have these data stored in my physical computer not even for a single minute. I know how to connect my RapidMiner Studio to a SQL Server and access data from the cloud, however, when running a process, the data gets downloaded to my computer.
How would you guys recommend I tackle this issue? Is there a way to use RM 100% in the cloud? or have it access data that is 100% in the cloud? Not sure if RapidMiner Server would help me, I've never used it.
Thank you.
Eduardo.
Tagged:
0
Best Answer
-
Hi @artavia_eduardo
Is it not even allowed to be loaded into the memory of your computer (so not stored on the disk)? If even loading in memory is not allowed it is impossible for a program running on your computer to do anything with the data, because obviously it need to be able to access the data.
If this is the case I have a few suggestions which might work, but have to be investigated:
- You could use the In-Database extension. With this extension you can create complex SQL commands which are then executed in the SQL database. Unfortunately you will be of course limited to the functionality SQL is providing. There is no possibility to leverage RM specific functionality through the SQL commands. But you could use if you can perform an anonymisation of your data in the SQL database before loading it to your PC and applying any RM logic on it. After that you could use the In-Database extension again to update the original data with for example scored values. Don't know if you are allowed to use anonymised data on your computer
- You can install RM Server on the same Cloud Hardware were the Database is located. Then the execution of any RM Process on this RM Server is in the same "Cloud" as the data itself
- You can use our "Pay as you Go" licences for RM Server (https://rapidminer.com/pricing/ under RapidMiner Server (Cloud). This would use a RM Server instance on either Amazon AWS or Microsoft Azure. Would be in the cloud, but probably not in the same Cloud structure as your data.
If it is allowed to load the data in memory, just don't use Store (or Write) operators. Load the data from SQL, process it and update the SQL-DB again all in one process.
Hopes this helps
Best regards
Fabian
2
Answers
-
Hi @artavia_eduardo
Is it not even allowed to be loaded into the memory of your computer (so not stored on the disk)? If even loading in memory is not allowed it is impossible for a program running on your computer to do anything with the data, because obviously it need to be able to access the data.
If this is the case I have a few suggestions which might work, but have to be investigated:
- You could use the In-Database extension. With this extension you can create complex SQL commands which are then executed in the SQL database. Unfortunately you will be of course limited to the functionality SQL is providing. There is no possibility to leverage RM specific functionality through the SQL commands. But you could use if you can perform an anonymisation of your data in the SQL database before loading it to your PC and applying any RM logic on it. After that you could use the In-Database extension again to update the original data with for example scored values. Don't know if you are allowed to use anonymised data on your computer
- You can install RM Server on the same Cloud Hardware were the Database is located. Then the execution of any RM Process on this RM Server is in the same "Cloud" as the data itself
- You can use our "Pay as you Go" licences for RM Server (https://rapidminer.com/pricing/ under RapidMiner Server (Cloud). This would use a RM Server instance on either Amazon AWS or Microsoft Azure. Would be in the cloud, but probably not in the same Cloud structure as your data.
If it is allowed to load the data in memory, just don't use Store (or Write) operators. Load the data from SQL, process it and update the SQL-DB again all in one process.
Hopes this helps
Best regards
Fabian
2 -
Does that mean that you cannot explore the data? That would make sense, aside from whether the data sits on the disk or in memory.I think you can build a procedure on the database to train and test the performance of a model, then you would only receive the results (e.g. confusion matrix) on your computer. I imagine that you can set up a solution with Postgresql and Python, but it needs help from the data provider.Solutions with RM Server don't seem too apply, unless the provider of the data is allowed to install the server locally. Once you copy data away from the originator, it is the same whether it sits on your computer or in a RM Server on the cloud.Regards,Sebastian
0 -
Hi Fabian, Sebastian,Do you know how does Rapidminer interact with data stored in Amazon Redshift or in Azure data lake? Does it always pull/ download this data and load it onto memory in order to analyse it?Thanks0
-
Hi @Nikouy,
As I already wrote in my first response, a program (not only RapidMiner) is not able to execute any analysis on data without accessing it. So, if you want to execute RapidMiner locally, it has to load the data in memory to analyse it. Everything what I wrote in my first response is also true for Amazon Redshift or Azure data lake. You can use our "Pay as you Go" licences for RM Server (https://rapidminer.com/pricing/ under RapidMiner Server (Cloud). This would use a RM Server instance on either Amazon AWS or Microsoft Azure and connect to Redshift or Azure data lake. Than the execution will happen on the cloud servers of AWS/Azure.
Best regards,
Fabian0