Hello,
I am attempting to do some basic toeknization of text files. I will then attempt to cluster them
Right now, I am testing with only 200 small text files. RM processes for a while and then gives me an out of memory error. I have given 1GIG of memory to RM.
I would eventually like to use RM to cluster batches of 1,000 or even 10,000 files, but am concerned that I can not even do the basic tokenization of only 200.
Please let me know if you have any ideas or suggestions.
Thanks!!
---------------------
Below is the XML of my process
<process version="4.2">
<operator name="Root" class="Process" expanded="yes">
<operator name="TextInput" class="TextInput" expanded="yes">
<parameter key="create_text_visualizer" value="true"/>
<parameter key="default_content_language" value="english"/>
<list key="namespaces">
</list>
<parameter key="on_the_fly_pruning" value="0"/>
<parameter key="prune_below" value="10%"/>
<list key="texts">
<parameter key="News_Articles" value="/Users/noah/Desktop/test_files"/>
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
<operator name="TermNGramGenerator" class="TermNGramGenerator">
</operator>
</operator>
</operator>
</process>