Hi all,
I want to do some text classification tasks out of a self-written Java program using RapidMiner. I already learned a SVN Classification model and stored it to the repository. In my Java application, I read ids out of a database which points to my HDD where the text data is stored. This data is passed to RapidMiner. In order to save memory, the classification task isn't done for all data at once. Instead, I use block sizes. This is basically my application:
public class ApplyModel {
static String process_definition_file = "apply_model.xml";
static int num_of_domains = 100000;
static int block_size = 100; // determines the number of examples classified at once
static Boolean debug = true;
public static void main(String[] args) {
System.out.println("START OF APPLY MODEL");
try {
// set RapidMiner confs
RapidMiner.setExecutionMode(ExecutionMode.COMMAND_LINE);
int start = 0;
int iteration = 1;
while(start < num_of_domains) {
// init RapidMiner
RapidMiner.init();
// read process definition
Process rm = new Process(new File(process_definition_file));
// avoid to fetch block size if limit is smaller than block size
int current_limit = block_size;
if(num_of_domains < block_size)
current_limit = num_of_domains;
// get data
ImmutableList<RapidMiner2Row> data = [...]
// transform to ExampleSet
ExampleSet ex = new CData2ExampleSet().getExampleSet(data);
// create IO Object
IOObject ioo = ex;
IOContainer ioc = new IOContainer(new IOObject[] {ioo});
// run RapidMiner process
IOContainer res_ioc = rm.run(ioc);
// analyze results
if(res_ioc.getElementAt(0) instanceof ExampleSet) {
ExampleSet resultSet = (ExampleSet)res_ioc.getElementAt(0);
// go through results
for (Example example : resultSet) {
[...]
}
}
start += current_limit;
iteration++;
// clean up
cdata = null;
data = null;
ex = null;
ioo = null;
ioc = null;
rm = null;
} // end of while
}
catch(Exception e) {
[...]
}
System.out.println("END OF APPLY MODEL");
}
}
Although the RapidMiner process is reinitiated for every data block, i am running into an OutOfMemory Exception (GC overhead limit exceeded). The memory problem depends on the actual amount of data. It only makes a small difference whether I run 100 iterations with 10 data sets or 10 iterations with 100 data sets. Does anyone have an idea?
Regards
Merlot