"text mining and EMCluster"

rdmckinney
rdmckinney New Altair Community Member
edited November 5 in Community Q&A
I’m having some issues with the EMClustering operator. I am using StringTextInput to do some text mining. That operator sends about 1,700 variables to the SVDReduction operator which then reduces the data to 15 variables. Then the EMClustering operator attempts to cluster about 500 examples based on the 15 variables from SVDReduction.

There seems to be a trade off between the numbers of clusters I can request in EMClustering the number of variables output by SVDReduction. If I ask EMClustering for just 5 clusters, then I can have SVDReduction output as many as 25 variables. But if I ask EMClustering for 10 clusters, the max number of variables it will accept from SVDReduction is 15. If SVDReduction provides more, say 20 variables, then I get the error message below. I have tried increasing the max_runs and max_optimization_steps, and that helps a little, but not doesn’t increase the number of variables that EMClustering will accept as input a great deal.

Currently, I’m asking for 10 clusters from EMClustering, with 10 max_runs and 200 max_optimization_steps. The max number of variables that EMClustering will accept from SVDReduction without the fatal error is 10. Any thoughts on this?

I frequently get this error: "Error: Can't compute the covariance of the matrix. Maybe the matrix is singular. Changing option "correlated_attributes" to false." But when I select OK, the program finishes and I get one cluster with every example in it.


G Jul 17, 2009 9:20:56 AM: [Fatal] NullPointerException occured in 1st application of EMClustering (EMClustering)
G Jul 17, 2009 9:20:56 AM: [Fatal] Process failed: operator cannot be executed. Check the log messages...
         Root[1] (Process)
         +- ExampleSource[1] (ExampleSource)
         +- StringTextInput[1] (StringTextInput)
         |  +- StringTokenizer[940] (StringTokenizer)
         |  +- EnglishStopwordFilter[940] (EnglishStopwordFilter)
         |  +- TokenLengthFilter[940] (TokenLengthFilter)
         |  +- PorterStemmer[940] (PorterStemmer)
         +- SVDReduction[1] (SVDReduction)
here ==> +- EMClustering[1] (EMClustering)
         +- ExcelExampleSetWriter[0] (ExcelExampleSetWriter)
<operator name="Root" class="Process" expanded="yes">
   <parameter key="logverbosity" value="error"/>
   <operator name="ExampleSource" class="ExampleSource">
       <parameter key="attributes" value="C:\Documents and Settings\rkenney\My Documents\rm_workspace\Comments09_2.aml"/>
   </operator>
   <operator name="StringTextInput" class="StringTextInput" expanded="yes">
       <parameter key="remove_original_attributes" value="true"/>
       <parameter key="default_content_language" value="english"/>
       <list key="namespaces">
       </list>
       <operator name="StringTokenizer" class="StringTokenizer">
       </operator>
       <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
       </operator>
       <operator name="TokenLengthFilter" class="TokenLengthFilter">
           <parameter key="min_chars" value="3"/>
       </operator>
       <operator name="PorterStemmer" class="PorterStemmer">
       </operator>
   </operator>
   <operator name="SVDReduction" class="SVDReduction">
       <parameter key="keep_example_set" value="true"/>
       <parameter key="return_preprocessing_model" value="true"/>
       <parameter key="dimensions" value="15"/>
   </operator>
   <operator name="EMClustering" class="EMClustering">
       <parameter key="k" value="10"/>
       <parameter key="max_runs" value="30"/>
       <parameter key="max_optimization_steps" value="200"/>
   </operator>
   <operator name="ExcelExampleSetWriter" class="ExcelExampleSetWriter">
       <parameter key="excel_file" value="C:\Projects\Memb Sat Survey\2009\Data\RapidMinerOutput\RMClusters.xls"/>
   </operator>
</operator>

Answers

  • land
    land New Altair Community Member
    Hi,
    would it be possible to send me the reduced data set? Then I would be able to reproduce the error and take a look at the code.

    Greetings,
      Sebastian
  • rdmckinney
    rdmckinney New Altair Community Member
    Sebastion, Thanks! How would you like me to send it and in what format? Also, do you want the output from the StringTextInput operator or the the SVDReduction operator?
  • land
    land New Altair Community Member
    Hi,
    please save the example set produced by the SVDReduction using the ExampleSetWriter and either upload it anywhere and share the link with me, or compress and send it via mail at my email adress.

    Greetings,
      Sebastian


  • rdmckinney
    rdmckinney New Altair Community Member
    I'll have to send it by email. What email address do you want me to use?
  • land
    land New Altair Community Member
    I'll send it by pm...

    Greetings,
      Sebastian