Standard Data Sets - memory issue

Question

Hi,

I am trying to run some standard datasets like 20 newsgroups or reuters21578 but unfortunately I run into memory problems. The reuters coul be used for nearest neighbour but nothing else, the 20 newsgroups didn't run at all... Maybe I am doing something wrong?!
I use the Rapidminer 4.5....

Do you have some hints for me?

Thanks,
Sven

land · Answer

Hi Sven,
it just finished loading the data. The results are somehow overwhelming: around 46.000 examples with 120.000 attributes. If stored in a standard, non sparse array, this would consume around 36 GB of RAM. The standard kNN will not work on this. Never. Even if it would save it in a sparse array, it would have to look through each of the trainings examples for classifying ONE new example and each time it would have to compute the distance over all this attributes...
So you should simply replace the kNN in the operator tree of your process with the LibSVM or the NaiveBayes operator. This should work then...

Greetings,
  Sebastian

svendeswan · Answer

Dear Sebastian,

thank you for your hints. Maybe I wait for your results :-) I am trying to build some kind of perfomance matrix for this dataset (and the Reuters too) using different learners and preprocessing. In my experience kNN worked well for big data sets in the past, but I never tried this with Rapidminer. Maybe you could paste the process then so I have the chance to build my matrix by simply exchanging the learner operators :-)

Best,
Sven

land · Answer

Hi Sven,
the TextInput operator always creates a sparse example set if you don't switch on extend_exampleset. Then it would depend on the input example set.
I have downloaded the data set and will try myself. But I think I already know whats the problem: Unlike the data set, KNN does not save the data in a sparse format. That causes the memory consumption to explode. Just think of a matrix of 45000x10000 entries a 4 bytes to get an impression of how many data would have to be stored. Nearest Neighbors isn't a good idea on text data at all, especially on so many examples and becomes completely worthless, if you don't switch the distance measure to cosine similarity.
SVMs or NaiveBayes should cope with this amout of data  much better and will have a better performance anyway.

Greetings,
  Sebastian