mining high dimensional data...
hello!!!!
i have a large dimensional data set.... actually the data set consist of about 2000 record and its dimension is about 2000 indeed....
i admit maybe i am still amateur in mining high dimensional data... ;D
what i'm going to ask is how are strategies to mine high dimensional data using RM5.
thank you for your immediate reply!!!
regs,
dimas yogatama
i have a large dimensional data set.... actually the data set consist of about 2000 record and its dimension is about 2000 indeed....
i admit maybe i am still amateur in mining high dimensional data... ;D
what i'm going to ask is how are strategies to mine high dimensional data using RM5.
thank you for your immediate reply!!!
regs,
dimas yogatama
Find more posts tagged with
Sort by:
1 - 6 of
61
ow, maybe i didn't make it clearer yet, sorry...
haddock wrote:
Hi there Dimas,
I'm not clear as to what you want to know, but you should understand that RM can handle much larger datasets than you are talking about. For example if I run the following to get a 10k * 10k matrix ...<?xml version="1.0" encoding="UTF-8" standalone="no"?>It doesn't take too long.
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="391" width="915">
<operator activated="true" class="generate_massive_data" expanded="true" height="60" name="Generate Massive Data" width="90" x="135" y="90">
<parameter key="sparse_representation" value="false"/>
</operator>
<connect from_op="Generate Massive Data" from_port="output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Just so you can compare I'm on XP64 double quad with 16G, and for windows boxes it is that 64 that matters, as 32 bit boxes can only address 3??G ( you'll have to Google for the right number ).
So the bottom line is that the main strategy is to have lots of memory, if I remember correctly..
what i mean strategy is that, how to optimize accuracy by selecting only "good attribute" among all available ones.... if the specs issue is critical, i only have laptop (lenovo y450-310) with core 2 duo processor @2200 ghz, and 2 gb ddr3 of ram, is it really bothering...?

after all i would like to say sorry for my english, i am still learning.
regs,
Dimas Yogatama
Hi Dimas,
of course the amount of memory does make a difference. If the data doesn't fit into the memory, it either fails or you will need it to stream it from a database what might slow down your process a lot.
Coming back to the strategy question: RapidMiner offers several methods for selecting attributes. You might either use the Forward Selection or Backward Elimination operator as a simple start. If that does not suit your needs or they take too long, you might take another operator from the package and it's sub packages Data Transformation / Attribute Set Reduction and Transformation / Selection.
Greetings,
Sebastian
of course the amount of memory does make a difference. If the data doesn't fit into the memory, it either fails or you will need it to stream it from a database what might slow down your process a lot.
Coming back to the strategy question: RapidMiner offers several methods for selecting attributes. You might either use the Forward Selection or Backward Elimination operator as a simple start. If that does not suit your needs or they take too long, you might take another operator from the package and it's sub packages Data Transformation / Attribute Set Reduction and Transformation / Selection.
Greetings,
Sebastian
how about attribute weighting? how is the performance between attribute selection/attribute set reduction vs attribute weighting based on your experience in mining high dimensional data?
Sebastian Land wrote:
Hi Dimas,
of course the amount of memory does make a difference. If the data doesn't fit into the memory, it either fails or you will need it to stream it from a database what might slow down your process a lot.
Coming back to the strategy question: RapidMiner offers several methods for selecting attributes. You might either use the Forward Selection or Backward Elimination operator as a simple start. If that does not suit your needs or they take too long, you might take another operator from the package and it's sub packages Data Transformation / Attribute Set Reduction and Transformation / Selection.
Greetings,
Sebastian
then what is actually affect the length of model learning by general if we talk about data? is it the total sum of the data (record) or its dimension?
Hello Dimas,
well, there is no general answer for this. In some settings the complete removal of attributes works better, in some others a rescaling based on weights. The same is true if it comes to "weight by wrapper" vs. "weight by filtering". From my experience, I would say that if you have severe problems with data set size and no other option is possible for you, the calculation of weights followed by a weight based selection can help without loosing too much accuracy.
Cheers,
Ingo
well, there is no general answer for this. In some settings the complete removal of attributes works better, in some others a rescaling based on weights. The same is true if it comes to "weight by wrapper" vs. "weight by filtering". From my experience, I would say that if you have severe problems with data set size and no other option is possible for you, the calculation of weights followed by a weight based selection can help without loosing too much accuracy.
Cheers,
Ingo
I'm not clear as to what you want to know, but you should understand that RM can handle much larger datasets than you are talking about. For example if I run the following to get a 10k * 10k matrix ... It doesn't take too long. Just so you can compare I'm on XP64 double quad with 16G, and for windows boxes it is that 64 that matters, as 32 bit boxes can only address 3??G ( you'll have to Google for the right number ).
So the bottom line is that the main strategy is to have lots of memory, if I remember correctly..