"Regarding text mining"

ratheesan
ratheesan New Altair Community Member
edited November 5 in Community Q&A
Hi,

Which Text Mining Operator can we use to extract combination of words or pattern of words in RM.
I have used string tokenizer,stopwordfilter and  Token length filter.and find out TFIDF,Term Frequency e.t.c.
Can anybody suggest a specific algorithm for solving the problem.
Thanks
Ratheesan

Answers

  • land
    land New Altair Community Member
    Hi,
    you could use BinaryOccurrences instead of TFIDF and then convert the numerical 0's and 1's to binominal values in order to apply FP-Growth. You will get FrequentItemSets containing the words occurring together in documents. Using the support threshold you can control how frequent they have to occur together.

    Greetings,
      Sebastian
  • ratheesan
    ratheesan New Altair Community Member
    Thanks Sebastain for your valued help and advice.I worked with the text like you mentioned.But I am getting an error message "Process failed.StackOverfloeError caught null".Here I am attaching the xml.

    <operator name="Root" class="Process" expanded="yes">
        <operator name="TextInput" class="TextInput" expanded="yes">
            <list key="texts">
              <parameter key="b" value="C:\Documents and Settings\ADMIN\Desktop\b"/>
            </list>
            <parameter key="vector_creation" value="BinaryOccurrences"/>
            <list key="namespaces">
            </list>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
            </operator>
            <operator name="TokenLengthFilter" class="TokenLengthFilter">
            </operator>
        </operator>
        <operator name="Numerical2Binominal" class="Numerical2Binominal">
            <parameter key="min" value="2.0"/>
            <parameter key="max" value="30000.0"/>
        </operator>
        <operator name="FPGrowth" class="FPGrowth">
            <parameter key="keep_example_set" value="true"/>
            <parameter key="min_number_of_itemsets" value="5"/>
        </operator>
    </operator>

    How can I overcome this problem.

    Thanks
    Ratheesan
  • land
    land New Altair Community Member
    Hi,
    if you put a break point after the Numerical2Binominal operator, does the program reaches it?
    If yes, I guess, the problem is the really memory consuming FP-Growth operator. The memory consumption depends heavily on the support level and you might increase it in order to get the things done. Of course you will receive less rules, because only rules with a higher support will be included at all.
    Please take a look at the memory monitor, to check that you have assigned RapidMiner enough maim memory. It usually uses up to 80% of the RAM.

    Greetings,
      Sebastian
  • ratheesan
    ratheesan New Altair Community Member
    Hi,
    I applied decision tree in a text data.But not getting a proper result.Here I am attaching the process,Can you suggest me how to proceed this code.If my way is not correct ,could you please suggest an alternative.

    <operator name="Root" class="Process" expanded="yes">
        <operator name="TextInput" class="TextInput" expanded="yes">
            <list key="texts">
              <parameter key="mydata" value="C:\Documents and Settings\ADMIN\Desktop\summary"/>
            </list>
            <list key="namespaces">
            </list>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
            </operator>
            <operator name="TokenLengthFilter" class="TokenLengthFilter">
            </operator>
        </operator>
        <operator name="ChangeAttributeRole" class="ChangeAttributeRole">
            <parameter key="name" value="claimant"/>
            <parameter key="target_role" value="label"/>
        </operator>
        <operator name="DecisionTree" class="DecisionTree">
        </operator>
    </operator>

    Thanks
    Ratheesan
  • sudheendra
    sudheendra New Altair Community Member
    Hai Sebastain,

    I am also getting the same memory problem. I am using Windows OS with 3GB Ram. Is it quite sufficient to work. Please suggest

    Thanks,
    Sudheendra
  • land
    land New Altair Community Member
    Hi,
    TextMining usually incorporates a great number of attributes. A decision tree might become veeery large, if the data is difficult to split. You probably would gain a much better classification performance if you would use a linear SVM. But if your goal is an understandable model, you will have to stick with the tree, but you should limit its maximal depth to avoid the out of memory problem. Otherwise it wouldn't help the user anyway, because a tree with depth 10 would have 2047 nodes and already loses a lot of it's understandability :)

    Greetings,
      Sebastian