EXPORT Sparse Data
anagi
New Altair Community Member
Hello....
I am rather new to RapidMiner, and so my apology is this question is too basic.
I am trying to do some Text Mining of a relatively large dataset (>100MB), with RapidMiner, and i would like to export the results, TF-IDF, (after applying a Tokenizer, Stemmer, and Stop words Removal). The problem i have, is that when i use a "CSV export", or "ARFF export" operators, the file i receive is very large (>5GB), despite the data being very sparse.
I am not sure, if can write sparse data into CSV, but WEKA write sparse data in ARFF file format, and RapidMiner can read sparse data.
My question is: is it possible to instruct RapidMiner to make use of the sparsity of the data when exporting it to a file?
Cheers
I am rather new to RapidMiner, and so my apology is this question is too basic.
I am trying to do some Text Mining of a relatively large dataset (>100MB), with RapidMiner, and i would like to export the results, TF-IDF, (after applying a Tokenizer, Stemmer, and Stop words Removal). The problem i have, is that when i use a "CSV export", or "ARFF export" operators, the file i receive is very large (>5GB), despite the data being very sparse.
I am not sure, if can write sparse data into CSV, but WEKA write sparse data in ARFF file format, and RapidMiner can read sparse data.
My question is: is it possible to instruct RapidMiner to make use of the sparsity of the data when exporting it to a file?
Cheers
Tagged:
0
Answers
-
Hi,
of course this is possible (this is my default answer for all "is X possible"-questions ;D )
The operator "Write Special Format" is your friend. Try the special format "$s[;][:]" for example if you want to separate the columns by ";" and the index of the attributes by ":". The "$s" means "sparse format". You can find more information in the help text of the operator.
Here is a simple example process:
Have fun!
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.001" expanded="true" name="Process">
<process expanded="true" height="145" width="279">
<operator activated="true" class="retrieve" compatibility="5.1.001" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="write_special" compatibility="5.1.001" expanded="true" height="60" name="Write Special Format" width="90" x="179" y="30">
<parameter key="example_set_file" value="C:\Users\Ingo\Desktop\sparse_result.txt.dat"/>
<parameter key="special_format" value="$s[;][:]"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Write Special Format" to_port="input"/>
<connect from_op="Write Special Format" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Ingo0 -
Thank you very much for a quick and helpful reply... But if i may be greedy and ask another related question:
The solution u provided, writes the data without the attributes names (well, there is an option $v[name], but i am not sure how to use it?)
What should i replace the name with? and if it's the name of an attribute (a column from TF-IDF matrix), how do i populate this field before knowing a priori what are the attributes name (terms in the dictionaries) and how many of them are there?
I want to produce an ARFF sparse file, that contains the attribute names, (similar to the one produced by weka), and i would have thought, that i could connect the output of an ARFF file Operator to the Input of the Export Special Operator; or the other way around (mimiking the pipe unix operation), but that doesn't produce the required output format.
Any advice to a novice user, will be much appreciated, and very helpful to get me going with RM
Cheers
0