EXPORT Sparse Data

anagi · April 2011

Hello....

I am rather new to RapidMiner, and so my apology is this question is too basic.

I am trying to do some Text Mining of a relatively large dataset (>100MB), with RapidMiner, and i would like to export the results, TF-IDF, (after applying a Tokenizer, Stemmer, and Stop words Removal). The problem i have, is that when i use a "CSV export", or "ARFF export" operators, the file i receive is very large (>5GB), despite the data being very sparse.

I am not sure, if can write sparse data into CSV, but WEKA write sparse data in ARFF file format, and RapidMiner can read sparse data.

My question is: is it possible to instruct RapidMiner to make use of the sparsity of the data when exporting it to a file?

Cheers

IngoRM · April 2011

Hi,

of course this is possible (this is my default answer for all "is X possible"-questions ;D )

The operator "Write Special Format" is your friend. Try the special format "$s[;][:]" for example if you want to separate the columns by ";" and the index of the attributes by ":". The "$s" means "sparse format". You can find more information in the help text of the operator.

Here is a simple example process:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.001" expanded="true" name="Process">
    <process expanded="true" height="145" width="279">
      <operator activated="true" class="retrieve" compatibility="5.1.001" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="write_special" compatibility="5.1.001" expanded="true" height="60" name="Write Special Format" width="90" x="179" y="30">
        <parameter key="example_set_file" value="C:\Users\Ingo\Desktop\sparse_result.txt.dat"/>
        <parameter key="special_format" value="$s[;][:]"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Write Special Format" to_port="input"/>
      <connect from_op="Write Special Format" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Have fun!
Ingo

anagi · April 2011

Thank you very much for a quick and helpful reply... But if i may be greedy and ask another related question:

The solution u provided, writes the data without the attributes names (well, there is an option $v[name], but i am not sure how to use it?)

What should i replace the name with? and if it's the name of an attribute (a column from TF-IDF matrix), how do i populate this field before knowing a priori what are the attributes name (terms in the dictionaries) and how many of them are there?

I want to produce an ARFF sparse file, that contains the attribute names, (similar to the one produced by weka), and i would have thought, that i could connect the output of an ARFF file Operator to the Input of the Export Special Operator; or the other way around (mimiking the pipe unix operation), but that doesn't produce the required output format.

Any advice to a novice user, will be much appreciated, and very helpful to get me going with RM

Cheers

EXPORT Sparse Data

Answers

Categories