Hi,
I am currently using the Optimize Selection (Evolutionary) operator to perform feature selection for my dataset. My initial number of attributes is 600. I noticed that the operator tends to generate models containing 200 to 270 attributes only. These models are not optimum or near optimum as I have tried using forward selection and the optimum number of attributes is less than 30. Next, I tried restricting the maximum number of attributes to 30 in the evolutionary operator. The operator then generate models containing 20 to 30 attributes. It does not generate any models with less than 20 attributes.
I find this behavior of the evolutionary operator strange and wondered why the evolutionary operator did not explore any models containing less attributes. So I delve into the code com.rapidminer.operator.features.selection.GeneticAlgorithm.java. I zoomed in on the part where the initial population was created. The code is given below
double p = getParameterAsDouble(PARAMETER_P_INITIALIZE);
for (int i = 0; i < weights.length; i++) {
if (getRandom().nextDouble() < (1.0d - p)) {
weights = 1.0d;
}
}
Looking at the code, I knew what is the problem. Each attribute has a 50% (default value) chance of being selected. If you do a Monte Carlo simulation, you will realized that the resultant plot is a gaussian distribution and the 90% confidence interval is between 280 to 321 attributes (i.e. 280 to 321 attributes will be selected 90% of the time). Thus, it is nearly impossible for the operator to generate any models containing less than 30 attributes if the maximum number of attributes is not restricted. Even if you restrict the maximum to 30, it will still have the gaussian distribution and will not explore any models with less than 10 attributes. From my point of view, the initial population should be drawn from a uniform distribution and not a gaussian distribution. This will ensure that the search space is thoroughly explored. Thus, I would consider this as a bug. What do all of you think?