"Problem PCA SVM Large DATA set"
nivet
New Altair Community Member
Dear All
I have been using Rapidminer 4.5 64bit on Windows Vista 64 bit JDK64bit
RAM 4 GB CPU intel core2Duo 2.0 GHZ
My Dataset are ~30000 Attributes and ~12000 instances
----------------------------------------------------------------------------------------------------
I tried increasing the memory for Rapidminer 4.5 >> edit 2 file
C:\Program Files (x86)\Rapid-I\RapidMiner\scripts\RapidMinerGUI
## set the maximum amount of memory Java uses here or in an environment variable
#MAX_JAVA_MEMORY=4000
if [ -z "${MAX_JAVA_MEMORY}" ] ; then
MAX_JAVA_MEMORY=4000
echo "No maximum Java memory defined, using 4000 Mb..."
C:\Program Files (x86)\Rapid-I\RapidMiner\scripts\RapidMinerGUI.bat
rem ##########################################
rem ### Setting Maximal Amount of Memory ###
rem ##########################################
if "%MAX_JAVA_MEMORY%"=="" set MAX_JAVA_MEMORY=4000
--------------------------------------------------------------------------------------------------
I have some question?
1. Now I want to using the feature selection operator on a data with
PCA transformation Keep Top K highest score and I want to leaner with SVM. How can I do?
My XML
<operator name="Root" class="Process" expanded="yes">
<operator name="ArffExampleSource" class="ArffExampleSource">
<parameter key="data_file" value="D:\thairath2.train.arff"/>
<parameter key="label_attribute" value="29047"/>
</operator>
<operator name="ChiSquaredWeighting" class="ChiSquaredWeighting">
</operator>
<operator name="AttributeWeightSelection" class="AttributeWeightSelection">
<parameter key="weight_relation" value="top k"/>
<parameter key="k" value="3000"/>
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<operator name="NaiveBayes" class="NaiveBayes">
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="ClassificationPerformance" class="ClassificationPerformance">
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<list key="class_weights">
</list>
</operator>
</operator>
</operator>
</operator>
2. If I want to create new weighting . form my dataset thairath2.arff
eg. (Log2(every attribute in my dataset +2))^2
How can I do it?
Writing to new File and to learning with SVM….
Please suggest step by step..
3. I have a problem "Out of memory " errors and the process stops . In my dataset .
so if anyone has ideas / suggestions to solve my problem please let me know .
Regard
nivet
I have been using Rapidminer 4.5 64bit on Windows Vista 64 bit JDK64bit
RAM 4 GB CPU intel core2Duo 2.0 GHZ
My Dataset are ~30000 Attributes and ~12000 instances
----------------------------------------------------------------------------------------------------
I tried increasing the memory for Rapidminer 4.5 >> edit 2 file
C:\Program Files (x86)\Rapid-I\RapidMiner\scripts\RapidMinerGUI
## set the maximum amount of memory Java uses here or in an environment variable
#MAX_JAVA_MEMORY=4000
if [ -z "${MAX_JAVA_MEMORY}" ] ; then
MAX_JAVA_MEMORY=4000
echo "No maximum Java memory defined, using 4000 Mb..."
C:\Program Files (x86)\Rapid-I\RapidMiner\scripts\RapidMinerGUI.bat
rem ##########################################
rem ### Setting Maximal Amount of Memory ###
rem ##########################################
if "%MAX_JAVA_MEMORY%"=="" set MAX_JAVA_MEMORY=4000
--------------------------------------------------------------------------------------------------
I have some question?
1. Now I want to using the feature selection operator on a data with
PCA transformation Keep Top K highest score and I want to leaner with SVM. How can I do?
My XML
<operator name="Root" class="Process" expanded="yes">
<operator name="ArffExampleSource" class="ArffExampleSource">
<parameter key="data_file" value="D:\thairath2.train.arff"/>
<parameter key="label_attribute" value="29047"/>
</operator>
<operator name="ChiSquaredWeighting" class="ChiSquaredWeighting">
</operator>
<operator name="AttributeWeightSelection" class="AttributeWeightSelection">
<parameter key="weight_relation" value="top k"/>
<parameter key="k" value="3000"/>
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<operator name="NaiveBayes" class="NaiveBayes">
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="ClassificationPerformance" class="ClassificationPerformance">
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<list key="class_weights">
</list>
</operator>
</operator>
</operator>
</operator>
2. If I want to create new weighting . form my dataset thairath2.arff
eg. (Log2(every attribute in my dataset +2))^2
How can I do it?
Writing to new File and to learning with SVM….
Please suggest step by step..
3. I have a problem "Out of memory " errors and the process stops . In my dataset .
so if anyone has ideas / suggestions to solve my problem please let me know .
Regard
nivet
0
Answers
-
Hi,
to your first question:
Simply exchange the NaiveBayes operator by an appropriate SVM operator and the ChiSquaredWeighting by an PCA.
I'm not fully understanding what you mean with your second question. Are you going to transform each single attribute by this function? Then I would use an FeatureIterator, which will store each attribute name into an macro and execute its child operators. Put an AttributeGeneration operator inside this and use the macro in the generation formula to select the current attribute.
The third question is simple:
Calculating the PCA will need to build a covariance matrix. With 30.000 Attributes, the covariance matrix alone will need around 9 GB RAM = (30.000 x 30.000 x 8 bytes). Unnecessary to say, that there must be two matrices the same time in memory...
Greetings,
Sebastian0 -
thankyou so much.
I have any question?
1. I try form this tutorial ---> http://kmandcomputing.blogspot.com/search/label/datamining.
but i cannot find Read-input vector on --->rapidminer 4.5 + text plugin 4.5
i have error ---->
Error in: XValidation (XValidation) The operator needs some input of type com.rapidminer.example.ExampleSet which is not provided. Each operator defines which input is desired for applying this operator (these input objects are shown in operator info screen (F1)). Previous operators must load or produce the desired input objects. You can check the correct experiment setup by validating the experiment (via the icon or the menu item).
---------------------------------------------
<operator name="Root" class="Process" expanded="yes">
<operator name="TextInput" class="TextInput" expanded="no">
<list key="texts">
<parameter key="neg" value="D:\txt_sentoken\neg"/>
<parameter key="pos" value="D:\txt_sentoken\pos"/>
</list>
<parameter key="default_content_language" value="english"/>
<parameter key="prune_below" value="50"/>
<parameter key="prune_above" value="1970"/>
<parameter key="vector_creation" value="BinaryOccurrences"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="StopwordFilterFile" class="StopwordFilterFile">
<parameter key="file" value="D:\Text Plugin 4.5\stopword.txt"/>
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="D:\movie.dat"/>
<parameter key="attribute_description_file" value="D:\movie.aml"/>
</operator>
<operator name="GridParameterOptimization" class="GridParameterOptimization" expanded="yes">
<list key="parameters">
<parameter key="AttributeWeightSelection.k" value="100,300,500,1000,1500,2000,2500"/>
</list>
<operator name="InfoGainRatioWeighting" class="InfoGainRatioWeighting">
</operator>
<operator name="AttributeWeightSelection" class="AttributeWeightSelection">
<parameter key="weight_relation" value="top k"/>
<parameter key="k" value="2500"/>
</operator>
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<operator name="LibSVMLearner" class="LibSVMLearner">
<parameter key="kernel_type" value="linear"/>
<list key="class_weights">
</list>
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="ClassificationPerformance" class="ClassificationPerformance">
<parameter key="main_criterion" value="accuracy"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<list key="class_weights">
</list>
</operator>
</operator>
</operator>
</operator>
--------------------------------------------------------------------------------
0 -
2. I Want to edit value in dataset (.arff) in this formula ---->
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.59.6314&;rep=rep1&type=pdf
and export a new file .csv or pre processing to infogainweighting -----> feature selction ----> svm --->accurary...
how can i do it?
regard
nivet
0 -
Hi,
I would suggest to switch to RapidMiner 5.0. It eases the process design a lot by omitting the implicit data flow and shows explicitly, where the data comes from and goes to.
Unfortunately I didn't understand, which values you are going to change?
Greetings,
Sebastian0