"performance - data diet - compress information"
andk
New Altair Community Member
hi guys, insomniac as i am over my performance issues i am having with textmining/sentiment project i asked myself the following:
i am analyzing 30K documents creating a term frequency matrix with around 40K attributes which results into 1,2 billion datapoints. that seems too much for my 2gh duo core 4 gb ram macbook. even if the results are computed it needs endless time to load the result perspective. is there a process which is able to summarize redundant data? as far as the creation of the term frenquency matrix is concerned i think it wouldn't make any sense to change the pruning factor or something like that. crucial information would be lost. but a lot of documents refer to the same date for example. is it somehow possible to summarize data of one date in one line? sorry for my english, it is a little tricky to explain: now i have 30K which indeed just refer to one year, therefore actually only 365 lines would be necessary. on the other hand of course there is the problem with tokens, having a similar meaning but appear in different forms in the tf matrix. i have tried stemming of course but i am not very happy with the result. most of the words are really crippled by the concerning operators. are there any stemming decitionaries around which transform verbs for example to nouns. this would somehow conserve the logical interaction of words but in the same way kind of unify them and reduce the data set.
any help is appreciated! have a good night, or good morning (depending on when you read this)
andré
i am analyzing 30K documents creating a term frequency matrix with around 40K attributes which results into 1,2 billion datapoints. that seems too much for my 2gh duo core 4 gb ram macbook. even if the results are computed it needs endless time to load the result perspective. is there a process which is able to summarize redundant data? as far as the creation of the term frenquency matrix is concerned i think it wouldn't make any sense to change the pruning factor or something like that. crucial information would be lost. but a lot of documents refer to the same date for example. is it somehow possible to summarize data of one date in one line? sorry for my english, it is a little tricky to explain: now i have 30K which indeed just refer to one year, therefore actually only 365 lines would be necessary. on the other hand of course there is the problem with tokens, having a similar meaning but appear in different forms in the tf matrix. i have tried stemming of course but i am not very happy with the result. most of the words are really crippled by the concerning operators. are there any stemming decitionaries around which transform verbs for example to nouns. this would somehow conserve the logical interaction of words but in the same way kind of unify them and reduce the data set.
any help is appreciated! have a good night, or good morning (depending on when you read this)
andré
Tagged:
0
Answers
-
Hi,
the words are crippled, but the engine don't care as it does not "read" the words as humans. If this reduces the number of attributes, you should go along with this. Anyway you will not be happy with later training performance if you go for something like the SVM: With a runtime of a power of three of number of examples it will take days or weeks to build one model...
If you can aggregate the texts to texts of one day, this would be fine. You can do this with a RapidMiner process using macros for example, but I don't have the time to go into details here in the community forum.
Greetings,
Sebastian0