suggestions for (pre)processing my datasource
harri678
New Altair Community Member
hello,
i am new with rapidminer and maybe somebody can help me with my problem! my datasource is the mysql database of an analysis tool containing measurement data. every 5 minutes a so called "transmission" is received. one transmission consists of 1-900.000 key-value pairs. those key-value-pairs represent a specific measured property ("descriptor") and the according value ("counter"). for better understanding i put the mysql-table-create statement here:
CREATE TABLE `temp_mem` (
`transmission_id` int(4) unsigned NOT NULL,
`descriptor` int(4) unsigned NOT NULL,
`counter` int(4) unsigned default NULL,
PRIMARY KEY (`transmission_id`,`descriptor`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
the content of this table looks like:
transmission_id descriptor counter
1788115 1 170257725
1788115 7 40078
1788115 32 5933
1788115 33 15167
.....
1788116 1 170255453
1788116 7 5863
.....
the so called descriptors are all defined in another sql-table with a short description. only descriptors with counter != 0 are submitted in a transmission. a typical transmission in reallife consists of about 5.000 to 300.000 different descriptors (and that every 5 minutes).
the thing i want to do with rapidminer is analyzing the collected data. i suspect there is a lot of redundant information in this measurement system, so i want to try some exploratory research on it and find the essential descriptors. my problem is the actual format of the stored data. for example using PCA, i would need some format like:
transmission_id desc_1 desc_2 desc_3 ...... desc_976638
1788115 170257725 0 0 ........ 0
1788116 170255453 44 0 ........ 1
......
i already tried pivot and transpose but it didn't work. rapidminer has its problems with datasets and approx. 900.000 attributes. can you give me some advice how to handle this kind of problem?
thanks in advance,
harald
i am new with rapidminer and maybe somebody can help me with my problem! my datasource is the mysql database of an analysis tool containing measurement data. every 5 minutes a so called "transmission" is received. one transmission consists of 1-900.000 key-value pairs. those key-value-pairs represent a specific measured property ("descriptor") and the according value ("counter"). for better understanding i put the mysql-table-create statement here:
CREATE TABLE `temp_mem` (
`transmission_id` int(4) unsigned NOT NULL,
`descriptor` int(4) unsigned NOT NULL,
`counter` int(4) unsigned default NULL,
PRIMARY KEY (`transmission_id`,`descriptor`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
the content of this table looks like:
transmission_id descriptor counter
1788115 1 170257725
1788115 7 40078
1788115 32 5933
1788115 33 15167
.....
1788116 1 170255453
1788116 7 5863
.....
the so called descriptors are all defined in another sql-table with a short description. only descriptors with counter != 0 are submitted in a transmission. a typical transmission in reallife consists of about 5.000 to 300.000 different descriptors (and that every 5 minutes).
the thing i want to do with rapidminer is analyzing the collected data. i suspect there is a lot of redundant information in this measurement system, so i want to try some exploratory research on it and find the essential descriptors. my problem is the actual format of the stored data. for example using PCA, i would need some format like:
transmission_id desc_1 desc_2 desc_3 ...... desc_976638
1788115 170257725 0 0 ........ 0
1788116 170255453 44 0 ........ 1
......
i already tried pivot and transpose but it didn't work. rapidminer has its problems with datasets and approx. 900.000 attributes. can you give me some advice how to handle this kind of problem?
thanks in advance,
harald
Tagged:
0
Answers
-
Hi Harald,
this is of course a very huge table, at least the number of attributes is enormous. You are already touching the dimensions, where you have to make extra effort, to handle the amount of data at all. A little example:
If you decide to use the PCA on a dataset with 900.000 attributes, the covariance matrix would have to be estimated. The covariance matrix would consist of 810.000.000.000 entries, each using up 8 bytes. This would fill around 6 TB (!) of data. And then you would need enough examples to make an estimation for each of these cells...So PCA is simply unusable.
So which ways are there to cope with the problem?
I would try to load the data in chunks, meaning, that you divide the complete data set into subsets, each containing only a limited number of attributes. You then could analyze this subset, removing attributes. Then combining two reduced subsets, analyze them again and so on.
The analysis again depends on what you are going to use the data for.
Greetings,
Sebastian0 -
Thanks for the response Sebastian!
During the afternoon I came to the same conclusion that it's impossible to analyse all measurements at once. Currently I've started to build "groups of related descriptors" where mutual analysis makes sense. My idea is to apply PCA to every group (5-10) and reduce them to 3-5 principal components. Is it possible to merge multiple ExampleSets together into one based on an index (in my case the "transmission_id") ?
e.g.:
group1 --> PCA --> transmission_id + PC1_1 + PC2_1 + PC3_1
group2 --> PCA --> transmission_id + PC1_2 + PC2_2
group3 --> PCA --> transmission_id + PC1_3 + PC2_3 + PC3_3
merged exampleSet: transmission_id + PC1_1 + PC2_1 + PC3_1 + PC1_2 + PC2_2 + PC1_3 + PC2_3 + PC3_3
Is this possible? I'm planning to use a merged exampleSet for further analysis, like clustering or learning.
Greetings, Harald0 -
Hi,
the ExampleSetJoin operator will do this for you
Greetings,
Sebastian0 -
Hello,
I've already figured it out after some trying! The last two days and nights I've been working with rapidminer or reading manual and lots of things are clear now
It's a real good piece of software -- speeds up my thesis progress alot!
Greets,
Harald0