"Is weight by Information Gain the right operator for me?"

Hi all,

I am using the operator "weight by Information Gain" in order to select the most predictive attributes from a data set with 218000 attribute and 60000 examples. (Actually, this is the resultant example set I got by of RapidMiner text processing.)

I have been waiting for 4 days so far and the process is still running on a PC with 32 GB of RAM. I am afraid this is not the right operator for my problem. Would you please explain if I have done something wrong.

BTW, as far as I could understand, the computational complexity of calculating information gain might be proportional to "number of attributes" * "number of examples" which is in my case 218000 * 60000 calculations. Do you think this might not be tractable in a PC? if yes, I do appreciate if you can propose any alternate solution.

Thanks in advance

Find more posts tagged with

AI Studio

Weights

Accepted answers

All comments

MartinLiebig

Hi!

200.000 Attributes is really a lot. Even in text mining you usually have less.

You might want to batch it and work on a subset of every attributes, write the weights to file and use it afterwards. Also a sample might be a good solution. Don't forget to use materialze data after the select attributes.

Cheers,
Martin

mohammadreza

Thanks for the answer Martin,

Just, would you please explain what is materialized data?

Thanks again

MartinLiebig

Hi

In Rapidminer an example set is usually just held one time in memory. If you select attributes, you do not delete them, but just deselect them. In order to get a real copy in memory you need to use the Materialze Data operator.
This is usually not needed. But in this special case you want to be sure to have an example without those attributes, thus i would recommend using it.

Cheers,
Martin

mohammadreza

Thank you so much Martin. That's a very technical and wise point that I was not aware of. If I got your point right, I think this trick will solve many of run time errors related to lack of main memory. Am I right? But another question which occupied my mind is that, if I learn a model with a non-materialized data which is obtained from a feature selection operator, Will this resultant model contains the "unselected" features too?!

Thanks in advance,