Hi,
I want to load some vectors with RapidMiner 5.0.001 RC from a file in sparse format for a similarity computation.
The file contains approx. 14 million non-zero entries in 9900 vectors. Each vector has a dimension of 9900. So per vector about 1400 components are non-zero.
For this task I used a read_sparse operator, followed by a data_to_similarity operator. The datamanagement property of read_sparse is set to int_sparse_array to save memory. If the process is started it will be stuck while reading the sparse file. After waiting for 20 minutes I terminated the process.
Switching to the int_array datamanagement the read_sparse operator finished in 30 seconds.
To check if the read_sparse operator works at all with int_sparse_array I measured the time for reading one example in readExamples() in MemoryExampleTable.java. It takes about 80 seconds for 20 lines. So it will take 11 hours to read all 9900 vectors.
Is the creation of an int_sparse_array really that slow or am I doing something wrong?
Although the non-sparse datamanagement works I need the sparse representation as the process will be applied to bigger datasets later.
And I know that even if the sparse reader finishes the similarity operator will take hours to compute its results but it might be replaced by a faster algorithm someday.
My computer has a Pentium M 1,6 GHz CPU and 2 GB RAM. According to the System Monitor of RapidMiner the JVM reserved 1.1 GB (max = total) but uses only 10% of it while running the process (during the first 20 minutes).
The process file looks like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="566" width="685">
<operator activated="true" class="read_sparse" expanded="true" height="60" name="Read Sparse" width="90" x="112" y="255">
<parameter key="format" value="no_label"/>
<parameter key="attribute_description_file" value="vectors.aml"/>
<parameter key="datamanagement" value="int_sparse_array"/>
<list key="prefix_map"/>
</operator>
<operator activated="true" class="data_to_similarity" expanded="true" height="76" name="Data to Similarity" width="90" x="313" y="255">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<connect from_op="Read Sparse" from_port="output" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
my vectors.aml file:
<?xml version="1.0" encoding="windows-1252" standalone="no"?>
<attributeset default_source="vectors.dat" encoding="windows-1252">
<id name="id" valuetype="integer"/>
<attribute name="dim" sourcecol="1" sourcecol_end="9945" valuetype="integer"/>
</attributeset>
and just a few samples of vectors.dat:
id:1 2:7 3:1 5:2 7:61 8:1 10:1 11:44 12:2 13:1 14:2 16:1 ...
id:2 1:7 3:1 4:27 5:1695 6:268 7:12457 8:961 9:46 10:35 ...
...
Thanks in advance,
Tobias