"Java Heap space ERROR"

noah977
noah977 New Altair Community Member
edited November 5 in Community Q&A
Hello,

I am attempting to do some basic toeknization of text files.  I will then attempt to cluster them

Right now, I am testing with only 200 small text files.  RM processes for a while and then gives me an out of memory error.  I have given 1GIG of memory to RM.

I would eventually like to use RM to cluster batches of 1,000 or even 10,000 files, but am concerned that I can not even do the basic tokenization of only 200.

Please let me know if you have any ideas or suggestions.

Thanks!!

---------------------

Below is the XML of my process

<process version="4.2">

  <operator name="Root" class="Process" expanded="yes">
      <operator name="TextInput" class="TextInput" expanded="yes">
          <parameter key="create_text_visualizer" value="true"/>
          <parameter key="default_content_language" value="english"/>
          <list key="namespaces">
          </list>
          <parameter key="on_the_fly_pruning" value="0"/>
          <parameter key="prune_below" value="10%"/>
          <list key="texts">
            <parameter key="News_Articles" value="/Users/noah/Desktop/test_files"/>
          </list>
          <operator name="StringTokenizer" class="StringTokenizer">
          </operator>
          <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
          </operator>
          <operator name="TokenLengthFilter" class="TokenLengthFilter">
              <parameter key="min_chars" value="3"/>
          </operator>
          <operator name="TermNGramGenerator" class="TermNGramGenerator">
          </operator>
      </operator>
  </operator>

</process>
Tagged:

Answers

  • land
    land New Altair Community Member
    Hi,
    the process just runs fine at my machine with the sample newgroup texts of the text miner plugin.
    Without the data I can't really say, where the error comes from, so I have to ask a few questions.
    Could you describe the memory monitor behavior before the error? Or even post a picture?

    Greetings,
      Sebastian
  • noah977
    noah977 New Altair Community Member
    Sebastian,

    Thanks for the reply.

    I have been able to easily run the sample without any problems.  The sample is only a few files.

    My files are all TEXT with no HTML or XML.  They vary in length from 4K to 400K. The total size of all 50 test files is 15.8M

    I assigned 1025m to RM before starting.  The memory monitor shows the memory growing and shrinking over time, but mostly growing. 

    I noticed that some of the steps are running "492" times.  This seems odd since I only have 50 files.  Is that a clue??

    I am getting a few strange warnings in the log:
    P Nov 25, 2008 3:07:43 PM: Process:
      Root[1] (Process)
      +- TextInput[1] (TextInput)
          +- StringTokenizer[492] (StringTokenizer)
          +- EnglishStopwordFilter[492] (EnglishStopwordFilter)
          +- TokenLengthFilter[492] (TokenLengthFilter)
          +- TermNGramGenerator[492] (TermNGramGenerator)
    P Nov 25, 2008 3:07:43 PM: [Warning] TextInput: Warning: Encoding  unknown. Using default.
    Last message repeated 906 times.
    P Nov 25, 2008 3:08:15 PM: [Warning] TextInput: The original example example set already contains an attribute named "label". This is likely to cause trouble. Please rename the attribute in the original example set.
    P Nov 25, 2008 3:08:15 PM: [Warning] TextInput: There is a term that equals the class attribute, renaming it
    P Nov 25, 2008 3:09:04 PM: [Warni
  • IngoRM
    IngoRM New Altair Community Member
    Hi,

    I noticed that some of the steps are running "492" times.  This seems odd since I only have 50 files.  Is that a clue??
    I think so. The log message also indicate that there are more than 50 files which are processed:

    P Nov 25, 2008 3:07:43 PM: [Warning] TextInput: Warning: Encoding  unknown. Using default.
    Last message repeated 906 times.
    So it might be that there is some issue with your directory setup, hidden files, ....

    By the way, we frequently work on text classification of thousands of texts without any problem, for short texts it's even hundreds of thousand texts. Of course the settings of the parameters have a great influence of memory usage. For example using n-grams or not enough pruning will blow up the number of dimensions a lot for large texts with many different words.

    Cheers,
    Ingo
  • noah977
    noah977 New Altair Community Member
    Hi,

    I think you're correct about two things:

    1) There was a hidden directory.  I have corrected the problem and now there are REALLY 50 files.

    2) I was attempting to create two word tokens.  From your answer, I thing that I may be creating a large amount of features this way.

    I would love to get some help on clustering documents.  If you are ever available to do any consulting, please let me know.

    Thanks!!!
  • IngoRM
    IngoRM New Altair Community Member
    Hi,

    ad 1) good to hear.

    ad 2) Yes. Let's say your texts have a length of 1000 words in average and are quite different. Then you will end with up to 1000000 attributes. Each attribute contains meta data of about 1 KB summing up to about 1 Gig plus data size plus... It is usually always worse to have huge numbers of attributes than huge numbers of examples.

    Hence, using word n-grams is only applicable for short texts with similar words. Something similar holds for character n-grams. But from my experience, the latter only help for shorts texts anyway.

    If you are ever available to do any consulting, please let me know.
    I actually do consulting (never noticed the company "Rapid-I" behind RapidMiner?)  :D. Please check out our web site at http://rapid-i.com or contact us for an offer.

    Cheers,
    Ingo
  • noah977
    noah977 New Altair Community Member
    Ingo,

    1) That makes perfect sense.  Don't know why I didn't see this before.  Without the n-gram step, the process finished MUCH faster.

    2) Now I need to figure out the best way to cluster the documents.  I trying to find some function that will decide on an ideal number of clusters based in "similarity" of documents.  (If I were instructing a person, I would tell them, "group these documents into piles that make sense.  Put documents with similar topics or ideas together.)  Do you know of a function that can do some intelligent grouping based on similarity - find the common "themes"

    3) I did see that Rapid-I offered some courses which are far from me, so I am unable to attend.  I might just want to hire an hour or two of phone time with someone.  Is that possible.

    Thanks!!!

    -N
  • IngoRM
    IngoRM New Altair Community Member
    Hello again,

    Do you know of a function that can do some intelligent grouping based on similarity - find the common "themes"
    RM provides a lot of clustering methods, some of them allow the definition of the similarity measure. For texts, usually cosine similarity is a good choice. But then you still have to tweak the number of clusters. There are some approaches for this (search, e.g. for the elbow criterion) and you can get the basic ideas from the samples in the clustering directory delivered together with RM.

    I might just want to hire an hour or two of phone time with someone.  Is that possible.
    Sure, we also offer phone consulting on an hour base. Please contact us via the contact form on our web site.

    Cheers,
    Ingo