I'm rather new to this whole Machine Learning thing and to RapidMiner specifically, and I'm having a bit of trouble understanding how Feature Selection works. I was wondering if a more experienced RM user would be willing to help me out.
My input is a list of 120 vectors containing 200 features each and tagged with one of 4 classes. Classification performance with Naive Bayes and 10-fold CV is 87.50%.
In an effort to improve this score further, I tried applying (backward) Feature Selection to the vectors first. This improved my score to 92.50%, which made me happy.
I then wanted to find out which features had been selected exactly to see if it would tell me anything about my data, so I added an AttributeWeightsWriter to my process. The full process looks like this:
<?xml version="1.0" encoding="windows-1252"?>
<process version="4.5">
<operator name="Root" class="Process" expanded="yes">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="...\200.aml"/>
<parameter key="sample_ratio" value="1.0"/>
<parameter key="sample_size" value="-1"/>
<parameter key="permutate" value="false"/>
<parameter key="decimal_point_character" value="."/>
<parameter key="column_separators" value=",\s*|;\s*|\s+"/>
<parameter key="use_comment_characters" value="true"/>
<parameter key="comment_chars" value="#"/>
<parameter key="use_quotes" value="true"/>
<parameter key="quote_character" value="""/>
<parameter key="quoting_escape_character" value="\"/>
<parameter key="trim_lines" value="false"/>
<parameter key="skip_error_lines" value="false"/>
<parameter key="datamanagement" value="int_sparse_array"/>
<parameter key="local_random_seed" value="-1"/>
</operator>
<operator name="FS" class="FeatureSelection" expanded="yes">
<parameter key="normalize_weights" value="true"/>
<parameter key="local_random_seed" value="-1"/>
<parameter key="show_stop_dialog" value="false"/>
<parameter key="user_result_individual_selection" value="false"/>
<parameter key="show_population_plotter" value="false"/>
<parameter key="plot_generations" value="10"/>
<parameter key="constraint_draw_range" value="false"/>
<parameter key="draw_dominated_points" value="true"/>
<parameter key="maximal_fitness" value="Infinity"/>
<parameter key="selection_direction" value="backward"/>
<parameter key="keep_best" value="1"/>
<parameter key="generations_without_improval" value="1"/>
<parameter key="maximum_number_of_generations" value="-1"/>
<operator name="FSChain" class="OperatorChain" expanded="yes">
<operator name="XValidation" class="XValidation" expanded="yes">
<parameter key="keep_example_set" value="false"/>
<parameter key="create_complete_model" value="false"/>
<parameter key="average_performances_only" value="true"/>
<parameter key="leave_one_out" value="false"/>
<parameter key="number_of_validations" value="10"/>
<parameter key="sampling_type" value="stratified sampling"/>
<parameter key="local_random_seed" value="-1"/>
<operator name="KernelNaiveBayes" class="KernelNaiveBayes">
<parameter key="keep_example_set" value="false"/>
<parameter key="laplace_correction" value="true"/>
<parameter key="estimation_mode" value="greedy"/>
<parameter key="bandwidth_selection" value="heuristic"/>
<parameter key="bandwidth" value="0.1"/>
<parameter key="minimum_bandwidth" value="0.1"/>
<parameter key="number_of_kernels" value="10"/>
<parameter key="use_application_grid" value="false"/>
<parameter key="application_grid_size" value="200"/>
</operator>
<operator name="ApplierChain" class="OperatorChain" expanded="yes">
<operator name="Applier" class="ModelApplier">
<parameter key="keep_model" value="false"/>
<list key="application_parameters">
</list>
<parameter key="create_view" value="false"/>
</operator>
<operator name="Evaluator" class="Performance">
<parameter key="keep_example_set" value="false"/>
<parameter key="use_example_weights" value="true"/>
</operator>
</operator>
</operator>
<operator name="ProcessLog" class="ProcessLog">
<list key="log">
<parameter key="generation" value="operator.FS.value.generation"/>
<parameter key="performance" value="operator.FS.value.performance"/>
</list>
<parameter key="sorting_type" value="none"/>
<parameter key="sorting_k" value="100"/>
<parameter key="persistent" value="false"/>
</operator>
</operator>
</operator>
<operator name="AttributeWeightsWriter" class="AttributeWeightsWriter">
<parameter key="attribute_weights_file" value="...\200.wgt"/>
</operator>
</operator>
</process>
And this is the part where I'm stumped: once the process finishes running and I examine the weights in the performance screen or in the .wgt file, I notice only ONE feature gets a weight of 0 while ALL others remain at 1. This still seems to give me the score of 92.50% I mentioned before.
But when I remove the one feature from my vectors prior to classification (either manually or by using AttributeWeightsLoader > AtrributeWeightsApplier) I only get a score of 87.50%, which is my score wihtout Feature Selection. So what's going on here? FS is obviously being much more active than just turning off one single feature. How do I find out which features it's been using so I can re-produce the results?
Thanks for your help.