"Applying Feature Selection on text input"

jebadiah
jebadiah New Altair Community Member
edited November 5 in Community Q&A
Hello. I am new to using  RapidMiner so please excuse my ignorance.

I am trying to perform K-Means Clustering on a set of text files. I have downloaded and installed the plug-in needed to input text files. Now, I want to apply Feature Selection to it. However, when I try to, it seems that it needs an ExampleSet to be able to perform the Feature Selection function. Is there a way for me to apply Feature Selection on text input?

Here is how my xml looks like right now:

<operator name="Root" class="Process" expanded="yes">
    <operator name="TextInput" class="TextInput" expanded="yes">
        <list key="texts">
          <parameter key="blogs" value="D:\Text-files"/>
        </list>
        <parameter key="vector_creation" value="TermFrequency"/>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
        <operator name="StopwordFilterFile" class="StopwordFilterFile">
            <parameter key="file" value="D:\stop.txt"/>
        </operator>
        <operator name="StopwordFilterFile (2)" class="StopwordFilterFile">
            <parameter key="file" value="D:\punctuations.txt"/>
        </operator>
    </operator>
    <operator name="KMeans" class="KMeans">
        <parameter key="k" value="8"/>
    </operator>
</operator>


When I try to add the ff:

<operator name="BackwardElimination" class="FeatureSelection" expanded="yes">
            <parameter key="selection direction" value="backward"/>
</operator>

The ff. error occurs:

Error in: TextInput (TextInput) Error in experiment setup: com.rapidminer.operator.MissingIOObjectException: The operator needs some input of type com.rapidminer.example.ExampleSet which is not provided


Can anyone please suggest something to help me do this. Thank you very much. :-*

Answers

  • jebadiah
    jebadiah New Altair Community Member
    Hi again. I was able to produce to this xml file

    <operator name="Root" class="Process" expanded="yes">
        <operator name="TextInput" class="TextInput" expanded="yes">
            <list key="texts">
              <parameter key="blogs" value="D:\Blogs-final"/>
            </list>
            <parameter key="vector_creation" value="TermFrequency"/>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="StopwordFilterFile" class="StopwordFilterFile">
                <parameter key="file" value="C:\Users\Jhermin\Desktop\dyermin\Thesis\src\Files\stop.txt"/>
            </operator>
            <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
                <parameter key="target_function" value="random"/>
            </operator>
        </operator>
        <operator name="BackwardElimination" class="FeatureSelection" breakpoints="after" expanded="yes">
            <parameter key="selection_direction" value="forward"/>
            <parameter key="show_stop_dialog" value="true"/>
        </operator>
        <operator name="KMeans" class="KMeans">
            <parameter key="k" value="8"/>
        </operator>
    </operator>


    but it returns this error:

    Root[1] (Process)
              +- TextInput[1] (TextInput)
              |  +- StringTokenizer[1] (StringTokenizer)
              |  +- StopwordFilterFile[1] (StopwordFilterFile)
              |  +- ExampleSetGenerator[1] (ExampleSetGenerator)
    here ==> |  +- BackwardElimination[1] (FeatureSelection)
              +- KMeans[0] (KMeans)


    I would really appreciate if anyone has any ideas why this error appears. Thanks a lot.
  • jebadiah
    jebadiah New Altair Community Member
    No one? Please? I really need to do this. Thanks in advance.
  • fischer
    fischer New Altair Community Member
    Hi,

    well, the approach you are taking is a bit, umh, ... broken. Feature selection does not work this way. An example of a ForwardSelection is in the samples folder under 05_features/10_ForwardSelection.xml. The important point is: You need to have your learner inside the forward-selection. otherwise, it does not know how to optimize. In general, the FS takes an ExampleSet and must contain operators that are able to evaluate such an example set by producing a PerformanceVector.

    As an aside, it might turn out that it is a bad idea to try backward elimination on text data.

    Best,
    Simon
  • jebadiah
    jebadiah New Altair Community Member
    Hello, thank you for your reply.

    I am currently trying out this xml:
    <operator name="Root" class="Process" expanded="yes">
        <operator name="TextInput" class="TextInput" expanded="yes">
            <list key="texts">
              <parameter key="blogs" value="D:\Blogs"/>
            </list>
            <parameter key="vector_creation" value="TermFrequency"/>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="StopwordFilterFile" class="StopwordFilterFile">
                <parameter key="file" value="C:\Users\Jhermin\Desktop\dyermin\Thesis\src\Files\stop.txt"/>
            </operator>
        </operator>
        <operator name="FS" class="FeatureSelection" expanded="yes">
            <operator name="XValidation" class="XValidation" expanded="yes">
                <parameter key="create_complete_model" value="true"/>
                <parameter key="number_of_validations" value="5"/>
                <parameter key="sampling_type" value="shuffled sampling"/>
                <operator name="NearestNeighbors" class="NearestNeighbors">
                    <parameter key="k" value="5"/>
                </operator>
                <operator name="ApplierChain" class="OperatorChain" expanded="yes">
                    <operator name="Applier" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="Performance" class="Performance">
                    </operator>
                </operator>
            </operator>
        </operator>
        <operator name="KMeans" class="KMeans">
            <parameter key="k" value="3"/>
        </operator>
    </operator>
    However, it is running very slowly. And it cannot accommodate about 300 text files, it returns Java Heap Space error. I have tried changing the rapidminerGUI script but nothing is changing. Do you have any idea how I can change the maximum size for the heap space?

    Thank you very much. You are very helpful.
  • land
    land New Altair Community Member
    Hi,
    the topic of adjusting the maximum heap size has been discussed in this forum a look of time. Please use the search button in order to find one of the discussions and the solutions.

    Greetings,
      Sebastian
  • keith
    keith New Altair Community Member
    Or check on the RM Wiki page on the topic:  http://rapid-i.com/wiki/index.php?title=Memory_Issues
  • land
    land New Altair Community Member
    Good hint. It seems, I'm not used to the Wiki, yet :)
  • Marcello_Sandi
    Marcello_Sandi New Altair Community Member
    Hi,

    There is an interesting problem over this model. I ran in my optimized workstation, which has 7GB exclusive memory to JVM and personalized JVM  arguments.

    I used hardware and graphic examples and appear a RuntimeException caught. java.lang.OutOfMemoryError: GC overhead limit exceeded. Very strange for such small bases.

    With this workstation I already run a BOW with 9700 words and 8500 lines.


    Using the top command on linux, I was watching the process and realized several PID java when running model.

    Marcello Sandi
  • land
    land New Altair Community Member
    Hi Marcello,
    we don't start any other java process, so probably this is an artifact from somewhere else...

    We are aware of the problem that the feature selection has sometimes problems on example sets with a really great number of attributes. Since those great numbers mostly occur on  text mining and feature selection on text mining is of limited use, the problem was not of top priority.
    But with the next major release we will add a more memory efficient variant.

    Greetings,
      Sebastian