"Problem PCA SVM Large DATA set"

nivet
nivet New Altair Community Member
edited November 5 in Community Q&A
Dear  All
I have been using Rapidminer 4.5 64bit  on Windows Vista 64 bit JDK64bit
RAM 4 GB  CPU intel core2Duo 2.0 GHZ
My Dataset  are  ~30000  Attributes  and  ~12000  instances
----------------------------------------------------------------------------------------------------
I tried increasing the memory for Rapidminer 4.5  >> edit 2  file
C:\Program Files (x86)\Rapid-I\RapidMiner\scripts\RapidMinerGUI 
## set the maximum amount of memory Java uses here or in an environment variable
#MAX_JAVA_MEMORY=4000
if [ -z "${MAX_JAVA_MEMORY}" ] ; then
    MAX_JAVA_MEMORY=4000
    echo "No maximum Java memory defined, using 4000 Mb..."

C:\Program Files (x86)\Rapid-I\RapidMiner\scripts\RapidMinerGUI.bat 
rem ##########################################
rem ###  Setting Maximal Amount of Memory  ###
rem ##########################################
if "%MAX_JAVA_MEMORY%"=="" set MAX_JAVA_MEMORY=4000


--------------------------------------------------------------------------------------------------

I have some question?


1. Now I want  to  using the feature selection operator on a data with
PCA  transformation  Keep Top K  highest  score  and  I want to leaner with SVM. How can I do?
My XML
<operator name="Root" class="Process" expanded="yes">
    <operator name="ArffExampleSource" class="ArffExampleSource">
        <parameter key="data_file" value="D:\thairath2.train.arff"/>
        <parameter key="label_attribute" value="29047"/>
    </operator>
    <operator name="ChiSquaredWeighting" class="ChiSquaredWeighting">
    </operator>
    <operator name="AttributeWeightSelection" class="AttributeWeightSelection">
        <parameter key="weight_relation" value="top k"/>
        <parameter key="k" value="3000"/>
    </operator>
    <operator name="XValidation" class="XValidation" expanded="yes">
        <operator name="NaiveBayes" class="NaiveBayes">
        </operator>
        <operator name="OperatorChain" class="OperatorChain" expanded="yes">
            <operator name="ModelApplier" class="ModelApplier">
                <list key="application_parameters">
                </list>
            </operator>
            <operator name="ClassificationPerformance" class="ClassificationPerformance">
                <parameter key="accuracy" value="true"/>
                <parameter key="classification_error" value="true"/>
                <parameter key="weighted_mean_recall" value="true"/>
                <parameter key="weighted_mean_precision" value="true"/>
                <list key="class_weights">
                </list>
            </operator>
        </operator>
    </operator>
</operator>








2. If I want to create new weighting . form my dataset  thairath2.arff
eg.  (Log2(every attribute  in my dataset +2))^2   
How can I do it? 
Writing to new File and to learning with SVM….
Please suggest step by step..

3.  I have a problem "Out of memory " errors and the process stops .  In my dataset  .


so if anyone has  ideas / suggestions to solve my problem please let me know .
Regard
nivet

Answers

  • land
    land New Altair Community Member
    Hi,
    to your first question:
    Simply exchange the NaiveBayes operator by an appropriate SVM operator and the ChiSquaredWeighting by an PCA.

    I'm not fully understanding what you mean with your second question. Are you going to transform each single attribute by this function? Then I would use an FeatureIterator, which will store each attribute name into an macro and execute its child operators. Put an AttributeGeneration operator inside this and use the macro in the generation formula to select the current attribute.

    The third question is simple:
    Calculating the PCA will need to build a covariance matrix. With 30.000 Attributes, the covariance matrix alone will need around 9 GB RAM = (30.000 x 30.000 x 8 bytes). Unnecessary to say, that there must be two matrices the same time in memory...

    Greetings,
      Sebastian
  • nivet
    nivet New Altair Community Member
    thankyou so much.

    I have any question?

    1. I try form  this tutorial  --->    http://kmandcomputing.blogspot.com/search/label/datamining.
    but  i cannot find Read-input vector  on --->rapidminer 4.5 + text plugin 4.5
    i have error ---->
    Error in: XValidation (XValidation) The operator needs some input of type com.rapidminer.example.ExampleSet which is not provided. Each operator defines which input is desired for applying this operator (these input objects are shown in operator info screen (F1)). Previous operators must load or produce the desired input objects. You can check the correct experiment setup by validating the experiment (via the icon or the menu item).


    ---------------------------------------------
    <operator name="Root" class="Process" expanded="yes">
        <operator name="TextInput" class="TextInput" expanded="no">
            <list key="texts">
              <parameter key="neg" value="D:\txt_sentoken\neg"/>
              <parameter key="pos" value="D:\txt_sentoken\pos"/>
            </list>
            <parameter key="default_content_language" value="english"/>
            <parameter key="prune_below" value="50"/>
            <parameter key="prune_above" value="1970"/>
            <parameter key="vector_creation" value="BinaryOccurrences"/>
            <list key="namespaces">
            </list>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="StopwordFilterFile" class="StopwordFilterFile">
                <parameter key="file" value="D:\Text Plugin 4.5\stopword.txt"/>
            </operator>
            <operator name="TokenLengthFilter" class="TokenLengthFilter">
            </operator>
            <operator name="PorterStemmer" class="PorterStemmer">
            </operator>
        </operator>
        <operator name="ExampleSetWriter" class="ExampleSetWriter">
            <parameter key="example_set_file" value="D:\movie.dat"/>
            <parameter key="attribute_description_file" value="D:\movie.aml"/>
        </operator>
        <operator name="GridParameterOptimization" class="GridParameterOptimization" expanded="yes">
            <list key="parameters">
              <parameter key="AttributeWeightSelection.k" value="100,300,500,1000,1500,2000,2500"/>
            </list>
            <operator name="InfoGainRatioWeighting" class="InfoGainRatioWeighting">
            </operator>
            <operator name="AttributeWeightSelection" class="AttributeWeightSelection">
                <parameter key="weight_relation" value="top k"/>
                <parameter key="k" value="2500"/>
            </operator>
        </operator>
        <operator name="XValidation" class="XValidation" expanded="yes">
            <operator name="LibSVMLearner" class="LibSVMLearner">
                <parameter key="kernel_type" value="linear"/>
                <list key="class_weights">
                </list>
            </operator>
            <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                <operator name="ModelApplier" class="ModelApplier">
                    <list key="application_parameters">
                    </list>
                </operator>
                <operator name="ClassificationPerformance" class="ClassificationPerformance">
                    <parameter key="main_criterion" value="accuracy"/>
                    <parameter key="accuracy" value="true"/>
                    <parameter key="classification_error" value="true"/>
                    <parameter key="weighted_mean_recall" value="true"/>
                    <parameter key="weighted_mean_precision" value="true"/>
                    <list key="class_weights">
                    </list>
                </operator>
            </operator>
        </operator>
    </operator>
    --------------------------------------------------------------------------------
    image
  • nivet
    nivet New Altair Community Member
    2.  I Want to edit value in dataset (.arff)  in this formula ---->
    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.59.6314&;rep=rep1&type=pdf

    and export a new file .csv or  pre processing  to    infogainweighting -----> feature selction  ----> svm  --->accurary...
    image


    how can i do it?

    regard
    nivet
  • land
    land New Altair Community Member
    Hi,
    I would suggest to switch to RapidMiner 5.0. It eases the process design a lot by omitting the implicit data flow and shows explicitly, where the data comes from and goes to.
    Unfortunately I didn't understand, which values you are going to change?

    Greetings,
      Sebastian