"Assessing features performance on different datasets"
Hello,
My question is:
How to identify the features that work best on various different datasets? This means those features have to be robust and transferable and independent by the specific characteristics of an individual dataset.
My data:
- two-class problem
- 7 datasets with about 50 identical numerical features (ranges can differ significantly, but the question is not to find robust thresholds but rather identifying the key features that have a good performance across all datasets)
- Each dataset with about 5000 instances for training and testing
My ideas so far:
- select for each of the 7 datasets an optimal feature subset (e.g. by a wrapper feature selection) and simply count the occurences over all 7 results
- also, calculate "information gain" of features for the individual datasets. The average out of all 7 tests will reveal the robust features (? ..hopefully).
Do you think the ideas are worth to follow? Can you give me a hint to some problems, improvements, RapidMiner algorithms etc. as I'm relatively new to RM and data mining?
Thanks and Greetings
ollestrat
My question is:
How to identify the features that work best on various different datasets? This means those features have to be robust and transferable and independent by the specific characteristics of an individual dataset.
My data:
- two-class problem
- 7 datasets with about 50 identical numerical features (ranges can differ significantly, but the question is not to find robust thresholds but rather identifying the key features that have a good performance across all datasets)
- Each dataset with about 5000 instances for training and testing
My ideas so far:
- select for each of the 7 datasets an optimal feature subset (e.g. by a wrapper feature selection) and simply count the occurences over all 7 results
- also, calculate "information gain" of features for the individual datasets. The average out of all 7 tests will reveal the robust features (? ..hopefully).
Do you think the ideas are worth to follow? Can you give me a hint to some problems, improvements, RapidMiner algorithms etc. as I'm relatively new to RM and data mining?
Thanks and Greetings
ollestrat