Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
"Is weight by Information Gain the right operator for me?"
mohammadreza
Hi all,
I am using the operator "weight by Information Gain" in order to select the most predictive attributes from a data set with 218000 attribute and 60000 examples. (Actually, this is the resultant example set I got by of RapidMiner text processing.)
I have been waiting for 4 days so far and the process is still running on a PC with 32 GB of RAM. I am afraid this is not the right operator for my problem. Would you please explain if I have done something wrong.
BTW, as far as I could understand, the computational complexity of calculating information gain might be proportional to "number of attributes" * "number of examples" which is in my case 218000 * 60000 calculations. Do you think this might not be tractable in a PC? if yes, I do appreciate if you can propose any alternate solution.
Thanks in advance
Find more posts tagged with
AI Studio
Weights
Accepted answers
All comments
MartinLiebig
Hi!
200.000 Attributes is really a lot. Even in text mining you usually have less.
You might want to batch it and work on a subset of every attributes, write the weights to file and use it afterwards. Also a sample might be a good solution. Don't forget to use materialze data after the select attributes.
Cheers,
Martin
mohammadreza
Thanks for the answer Martin,
Just, would you please explain what is materialized data?
Thanks again
MartinLiebig
Hi
In Rapidminer an example set is usually just held one time in memory. If you select attributes, you do not delete them, but just deselect them. In order to get a real copy in memory you need to use the Materialze Data operator.
This is usually not needed. But in this special case you want to be sure to have an example without those attributes, thus i would recommend using it.
Cheers,
Martin
mohammadreza
Thank you so much Martin. That's a very technical and wise point that I was not aware of. If I got your point right, I think this trick will solve many of run time errors related to lack of main memory. Am I right? But another question which occupied my mind is that, if I learn a model with a non-materialized data which is obtained from a feature selection operator, Will this resultant model contains the "unselected" features too?!
Thanks in advance,
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups