Non-uniform distribution of number of attributes in GeneticAlgorithm operator
pengie
New Altair Community Member
Hi,
I am currently using the Optimize Selection (Evolutionary) operator to perform feature selection for my dataset. My initial number of attributes is 600. I noticed that the operator tends to generate models containing 200 to 270 attributes only. These models are not optimum or near optimum as I have tried using forward selection and the optimum number of attributes is less than 30. Next, I tried restricting the maximum number of attributes to 30 in the evolutionary operator. The operator then generate models containing 20 to 30 attributes. It does not generate any models with less than 20 attributes.
I find this behavior of the evolutionary operator strange and wondered why the evolutionary operator did not explore any models containing less attributes. So I delve into the code com.rapidminer.operator.features.selection.GeneticAlgorithm.java. I zoomed in on the part where the initial population was created. The code is given below
I am currently using the Optimize Selection (Evolutionary) operator to perform feature selection for my dataset. My initial number of attributes is 600. I noticed that the operator tends to generate models containing 200 to 270 attributes only. These models are not optimum or near optimum as I have tried using forward selection and the optimum number of attributes is less than 30. Next, I tried restricting the maximum number of attributes to 30 in the evolutionary operator. The operator then generate models containing 20 to 30 attributes. It does not generate any models with less than 20 attributes.
I find this behavior of the evolutionary operator strange and wondered why the evolutionary operator did not explore any models containing less attributes. So I delve into the code com.rapidminer.operator.features.selection.GeneticAlgorithm.java. I zoomed in on the part where the initial population was created. The code is given below
Looking at the code, I knew what is the problem. Each attribute has a 50% (default value) chance of being selected. If you do a Monte Carlo simulation, you will realized that the resultant plot is a gaussian distribution and the 90% confidence interval is between 280 to 321 attributes (i.e. 280 to 321 attributes will be selected 90% of the time). Thus, it is nearly impossible for the operator to generate any models containing less than 30 attributes if the maximum number of attributes is not restricted. Even if you restrict the maximum to 30, it will still have the gaussian distribution and will not explore any models with less than 10 attributes. From my point of view, the initial population should be drawn from a uniform distribution and not a gaussian distribution. This will ensure that the search space is thoroughly explored. Thus, I would consider this as a bug. What do all of you think?
double p = getParameterAsDouble(PARAMETER_P_INITIALIZE);
for (int i = 0; i < weights.length; i++) {
if (getRandom().nextDouble() < (1.0d - p)) {
weights = 1.0d;
}
}
Tagged:
0
Answers
-
I have currently modified the code to
Also, I am currently setting PARAMETER_P_INITIALIZE to 0.0.
double p = getParameterAsDouble(PARAMETER_P_INITIALIZE);
int maxInitialAttributes = Math.max(1, (int)(numberOfAttributes * getRandom().nextDouble()));
int[] weightsIndices = new int[numberOfAttributes];
for (int i=0, endi=numberOfAttributes; i<endi; ++i)
{
weightsIndices = i;
}
Collections.shuffle(Arrays.asList(weightsIndices));
for (int i=0, endi=maxInitialAttributes; i<endi; ++i)
{
if (getRandom().nextDouble() < (1.0d - p))
{
weights[weightsIndices] = 1.0d;
}
}
BTW, the description for p initialize in the Help file states that it is the initial probability for an attribute to be switched on. From the code, it is actually the initial probability for an attribute to be switched off.0 -
Hi,
thanks for the input!
Your code seems to be a good enhancement for this operator but i can't decide this on my own.
Please open a feature request at http://bugs.rapid-i.com so we will have a look into this later on.
Greetings
Nils
0 -
Sorry, a small bug in my previous modifications. The shuffing isn't working. Below is a working version. Ok, I will open a feature request at the link.
double p = getParameterAsDouble(PARAMETER_P_INITIALIZE);
int maxInitialAttributes = Math.max(1, (int)(numberOfAttributes * getRandom().nextDouble()));
ArrayList<Integer> weightsIndices = new ArrayList<Integer>();
for (int i=0, endi=numberOfAttributes; i<endi; ++i)
{
weightsIndices.add(i);
}
Collections.shuffle(weightsIndices);
for (int i=0, endi=maxInitialAttributes; i<endi; ++i)
{
if (getRandom().nextDouble() < (1.0d - p))
{
weights[weightsIndices.get(i)] = 1.0d;
}
}0