Recommended way to load data from own Java code to RapidMiner?
wessel
New Altair Community Member
Dear All,
I have a double[][] in my own Java code.
I wish to load this double[][] in RapidMiner.
Currently I write this double[][] to a text file, and then parse back the numbers in RapidMiner.
Is there a better way to do this?
Maybe write out a binary file and somehow load this in RapidMiner?
This will save a lot of CPU time parsing and disk space, since my double[][] text file can easily be 10GB.
Best regards,
Wessel
I have a double[][] in my own Java code.
I wish to load this double[][] in RapidMiner.
Currently I write this double[][] to a text file, and then parse back the numbers in RapidMiner.
Is there a better way to do this?
Maybe write out a binary file and somehow load this in RapidMiner?
This will save a lot of CPU time parsing and disk space, since my double[][] text file can easily be 10GB.
Best regards,
Wessel
0
Answers
-
Are you using RapidMiner as a library integrated into your code, or do you load the data generated by your application from the normal RapidMiner GUI?
Regards,
Marius0 -
I'm loading data into the normal GUI.0
-
In this case I fear you have to use one the standard import methods of RapidMiner. What about the usage of a database as storage system?
If you want to dive into the code of RapidMiner, you can try to create an ExampleSet programmatically and write it directly as a file into the RapidMiner repository, but this also involves writing files, and feels a bit hackish.
Best, Marius0 -
Hey,
I think creating an .ioo (one you can load using the Retrieve operator) file is by far the fastest way.
I don't think its that hackish. I will try to get working code for this procedure. Can't be that hard, right?
When I write a large double[][] for personal use in my own code I always use ObjectOutputStream.
Does not feel like a hack at all. The code for reading and writing is extremely clean.
Best regards,
Wessel0 -
Yes, surely the code will be clean, but it's against the policy of not messing with the repository data structure by hand, and e.g. won't work if you are using a RapidAnalytics server, but only with local repositories. But as always: as long as the users are satisfied, everything is fine
Depending on what you are going to do, it might also be worth considering the use of a database, especially if you need random access to your data. Using csv files, ioo files or whatsoever always requires you to load the complete dataset.
Happy Coding!
~Marius0 -
I'm analyzing run time statistics of search algorithms.
So I measure N doubles at each time step, e.g. output of some heuristic.
Search algorithms can easily take 1M steps to complete.
So now I need to analyze a data-set of 1M * N doubles.
I don't see how a database system would help me here.
I need to analyze the entire data-set not just some small subset.
Right now I'm scaling to a point where just loading the data and parsing all the double takes more than a minute.
Simply retrieving the data-set later using the retrieve operator takes less than a 10th of this time.
Why would creating RM-ioo in your own code not work with RapidAnalytics?
Maybe its better to create a new "load binary data" Operator instead?
Best regards,
Wessel0 -
What I wanted to say is that you can't simply put the ioo file into a folder, because the RapidAnalytics Repositories are stored in a database. Of course you can also access the remote RapidAnalytics repository from your code, but it's more complicated that just writing a file. So my statement was maybe a bit misleading.wessel wrote:
Why would creating RM-ioo in your own code not work with RapidAnalytics?
What should that operator do? Which should be the binary format?
Maybe its better to create a new "load binary data" Operator instead?
Best, Marius0 -
Format?
A binary file is simply a sequence of bytes right?
And a double is simply 8-byte.
This is probably not the same for all programming languages, but Java uses doubleToLongBits and then writes that long value to the underlying output stream as an 8-byte quantity, high byte first.
So the operator should load the file, and create a data-set containing 1 attribute with the corresponding double values.0 -
Well, and here's where the problems start: there are only very limited cases where you would want to import a table with exactly one attribute, written by a Java tool. This probably won't work cross platform with data written from other languages because of byte ordering etc.
Of course you can implement such an operator for your personal use in an extension. If you are capable of writing java code, this will be rather easy and could probably even be done with the Execute Script operator.
Best, Marius0