Slow sparse file loading with sparse datamanagement

hennymcc
hennymcc New Altair Community Member
edited November 5 in Community Q&A
Hi,

I want to load some vectors with RapidMiner 5.0.001 RC from a file in sparse format for a similarity computation.
The file contains approx. 14 million non-zero entries in 9900 vectors. Each vector has a dimension of 9900. So per vector about 1400 components are non-zero.

For this task I used a read_sparse operator, followed by a data_to_similarity operator. The datamanagement property of read_sparse is set to int_sparse_array to save memory. If the process is started it will be stuck while reading the sparse file. After waiting for 20 minutes I terminated the process.
Switching to the int_array datamanagement the read_sparse operator finished in 30 seconds.
To check if the read_sparse operator works at all with int_sparse_array I measured the time for reading one example in readExamples() in MemoryExampleTable.java. It takes about 80 seconds for 20 lines. So it will take 11 hours to read all 9900 vectors.

Is the creation of an int_sparse_array really that slow or am I doing something wrong?

Although the non-sparse datamanagement works I need the sparse representation as the process will be applied to bigger datasets later.
And I know that even if the sparse reader finishes the similarity operator will take hours to compute its results but it might be replaced by a faster algorithm someday.

My computer has a Pentium M 1,6 GHz CPU and 2 GB RAM. According to the System Monitor of RapidMiner the JVM reserved 1.1 GB (max = total) but uses only 10% of it while running the process (during the first 20 minutes).

The process file looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
 <context>
   <input>
     <location/>
   </input>
   <output>
     <location/>
     <location/>
   </output>
   <macros/>
 </context>
 <operator activated="true" class="process" expanded="true" name="Process">
   <process expanded="true" height="566" width="685">
     <operator activated="true" class="read_sparse" expanded="true" height="60" name="Read Sparse" width="90" x="112" y="255">
       <parameter key="format" value="no_label"/>
       <parameter key="attribute_description_file" value="vectors.aml"/>
       <parameter key="datamanagement" value="int_sparse_array"/>
       <list key="prefix_map"/>
     </operator>
     <operator activated="true" class="data_to_similarity" expanded="true" height="76" name="Data to Similarity" width="90" x="313" y="255">
       <parameter key="measure_types" value="NumericalMeasures"/>
       <parameter key="numerical_measure" value="CosineSimilarity"/>
     </operator>
     <connect from_op="Read Sparse" from_port="output" to_op="Data to Similarity" to_port="example set"/>
     <connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>
my vectors.aml file:

<?xml version="1.0" encoding="windows-1252" standalone="no"?>
<attributeset default_source="vectors.dat" encoding="windows-1252">
<id name="id" valuetype="integer"/>
<attribute name="dim" sourcecol="1" sourcecol_end="9945" valuetype="integer"/>
</attributeset>
and just a few samples of vectors.dat:

id:1 2:7 3:1 5:2 7:61 8:1 10:1 11:44 12:2 13:1 14:2 16:1 ...
id:2 1:7 3:1 4:27 5:1695 6:268 7:12457 8:961 9:46 10:35 ...
...
Thanks in advance,
Tobias
Tagged:

Answers

  • land
    land New Altair Community Member
    Hi Tobias,
    thank you for this detailed report. I tried to reproduce the problem, but I didn't succeed. I have build this process for getting sparse data, but loading was always very fast:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="341" width="815">
          <operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="22" y="39">
            <parameter key="number_examples" value="1000"/>
            <parameter key="number_of_attributes" value="50"/>
          </operator>
          <operator activated="true" class="discretize_by_bins" expanded="true" height="94" name="Discretize" width="90" x="179" y="30">
            <parameter key="create_view" value="true"/>
            <parameter key="number_of_bins" value="50"/>
          </operator>
          <operator activated="true" class="nominal_to_binominal" expanded="true" height="94" name="Nominal to Binominal" width="90" x="313" y="30">
            <parameter key="create_view" value="true"/>
          </operator>
          <operator activated="true" class="nominal_to_numerical" expanded="true" height="94" name="Nominal to Numerical" width="90" x="447" y="30">
            <parameter key="create_view" value="true"/>
          </operator>
          <operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="581" y="30">
            <parameter key="attribute_filter_type" value="value_type"/>
            <parameter key="value_type" value="numeric"/>
          </operator>
          <operator activated="true" class="write_aml" expanded="true" height="60" name="Write AML" width="90" x="715" y="120">
            <parameter key="example_set_file" value="C:\sparse.dat"/>
            <parameter key="attribute_description_file" value="c:\sparse.aml"/>
            <parameter key="format" value="sparse_xy"/>
          </operator>
          <operator activated="true" class="read_sparse" expanded="true" height="60" name="Read Sparse" width="90" x="514" y="210">
            <parameter key="attribute_description_file" value="c:\sparse.aml"/>
            <parameter key="datamanagement" value="int_sparse_array"/>
            <list key="prefix_map"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Discretize" to_port="example set input"/>
          <connect from_op="Discretize" from_port="example set output" to_op="Nominal to Binominal" to_port="example set input"/>
          <connect from_op="Nominal to Binominal" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
          <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Write AML" to_port="input"/>
          <connect from_op="Read Sparse" from_port="output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="162"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Any comments on that?

    Greetings,
      Sebastian
  • hennymcc
    hennymcc New Altair Community Member
    Hi Sebastian,

    your example works but it generates very little data, 1000x50 values if I interpret this correctly. The file size of the generated data is 370KB. The file size of my 9500 vectors of dimension 9500 with 85% sparseness (14,000,000 of 9500*9500 values are non-zero) is about 120MB.
    I modified your example a little bit to be more close to my input file.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="596" width="547">
          <operator activated="true" class="read_sparse" expanded="true" height="60" name="Read Sparse" width="90" x="447" y="300">
            <parameter key="format" value="no_label"/>
            <parameter key="attribute_description_file" value="d:\sparse.aml"/>
            <parameter key="datamanagement" value="int_sparse_array"/>
            <list key="prefix_map"/>
          </operator>
          <operator activated="true" class="generate_massive_data" expanded="true" height="60" name="Generate Massive Data" width="90" x="45" y="165">
            <parameter key="number_examples" value="9500"/>
            <parameter key="number_attributes" value="9500"/>
            <parameter key="sparse_fraction" value="0.85"/>
          </operator>
          <operator activated="true" class="real_to_integer" expanded="true" height="76" name="Real to Integer" width="90" x="179" y="165"/>
          <operator activated="true" class="generate_id" expanded="true" height="76" name="Generate ID" width="90" x="313" y="165"/>
          <operator activated="true" class="write_aml" expanded="true" height="60" name="Write AML" width="90" x="447" y="165">
            <parameter key="example_set_file" value="d:\sparse.dat"/>
            <parameter key="attribute_description_file" value="d:\sparse.aml"/>
            <parameter key="format" value="sparse_no_label"/>
            <parameter key="overwrite_mode" value="overwrite"/>
          </operator>
          <connect from_op="Read Sparse" from_port="output" to_port="result 1"/>
          <connect from_op="Generate Massive Data" from_port="output" to_op="Real to Integer" to_port="example set input"/>
          <connect from_op="Real to Integer" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Write AML" to_port="input"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    The process is not as slow as with my original data but is is slow (will "only" take one hour to read all data).
    Note that an ID is generated for each vector (=data row). The ID attribute seems to cause the long loading time. If it is removed loading is performed much faster.
    As some vectors hold zero for all attributes and hence are not written to the data file I need those IDs.  Maybe empty lines can be used instead, I don't know. Any way adding IDs should not slow down loading that much.

    Bye
    Tobias