"Feature Selection operator"

Hello there
I am interested in using the Feature Selection tool of RapidMiner but though i understand the logic i am not able to grasp how this this tool works in RM .

To give a little background : I have a tab delimited file with 115 regular attributes + 1 target attribute with 110 examples . My objective is to select the optimal set of regular attributes which have the maximum contribution to the target attribute so that i can use them in a prediction model .
I dont want to use all the attributes in order to avoid over- training.

The tutorial says that i can connect the Feature selection operator directly to my ExampleSource .But when i actually do it , i checked the help stuff and it says i am suppose to use an ExampleSetGenerator in between (no idea y ?). In the end i still get errors and am not able to do it ??

Any help or comments or suggestions are highly appreciated.

Thanks
Emma

Find more posts tagged with

AI Studio

Feature Selection

Accepted answers

All comments

TobiasMalbrecht

Hi Emma,

first of all: of course the feature selection needs some data as input. But it does not matter, whether this data is loaded (e.g. via an ExampleSource operator) or generated. Then let me shortly explain how feature selection works in RapidMiner. The feature selection simply iterates over attribute sets. This means, it switches attributes on or off according to a specified strategy. Then it initiates an evaluation of the performance of a learner on the resulting data set (i.e. the current feature subset). This evaluation (and also the learning) is however not done by the feature selection itself but by its inner operators. Hence, the inner operator chain should be able to process an example set, learn a model on that example set and then evaluate the model and return a so-called performance vector. What you place inside the feature selection depends on your data and analysis goals. Normally, this would be a cross-validation with an appropriate learner and performance evaluator. You may find an example of such a process in the samples in the [tt]05_Features/10_ForwardSelection.xml[/tt] process.

Hope that clarifies a little bit how it works. Otherwise you are welcome to ask more questions here ...

Regards,
Tobias

MuehliMan

Hi,

I am searching for a Feature Selection too, and I was just wondering if and how you can influence the number of selected features. Perfect would be a criteria like the number of features extracted or the correlation that has to be achieved.
Is there a workaround for this procedure?

Greets, Markus

IngoRM

Hi,

both is possible. For some of the feature selection schemes like brute force or genetic algorithms, there is a parameter "exact_number_of_features". And all features selection operators should support a maximum fitness which stops feature selection after it is reached.

If you can not see those parameters, you might have to turn on the "Expert" mode by clicking the correspondingf icon in the toolbar. And you should also have a look into the delivered sample processes since they cover several different tasks with respect to feature selection / construction / weighting.

Cheers,
Ingo

Legacy User

Hi there,

first fo all: I appreciate your work for RapidMiner which seems to me to be the best environment for my "problem".

My goal is to classify huge textual records into three different categories. I use the WordVector plugin to convert my strings into word vectors. Unfortunately, due to the large amount of thousands of documents, I am getting a lot features. Therefore I am trying to incorporate FeatureSelection before applying a classifier.

I tried using RemoveCorrelatedAttributes, RemoveUselessAttributes withour success.

My plan is now the following:
- I would like to determine the keywords for each of the three classes (business, personal, personal but in professional context). I have hundreds of manually classified documents. By keywords I mean the set of words with highest tfidf-value for its category. Assume I have 10 keywords for each category. My feature space should then consist of 30 keywoards.

Unfortunately I failed to realize my plan. Does RapidMiner provide the ability to accomplish that? Any hints?

Thank you!

IngoRM

Hi Erich,

first of all: thanks for your kind words.

For this task, we developed the operator "CorpusBasedWeighting" which delivers attribute weights based on TFIDF-like example values. In combination with the AttributeWeightSelection and the ExampleSetJoin operator you could achieve the desired result (here on random data):


<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="number_of_attributes"	value="100"/>
        <parameter key="target_function"	value="sum classification"/>
    </operator>
    <operator name="IdTagging" class="IdTagging">
    </operator>
    <operator name="IOMultiplier" class="IOMultiplier">
        <parameter key="io_object"	value="ExampleSet"/>
    </operator>
    <operator name="CorpusBasedWeightingForPositive" class="CorpusBasedWeighting">
        <parameter key="class_to_characterize"	value="positive"/>
    </operator>
    <operator name="AttributeWeightSelectionForPositive" class="AttributeWeightSelection">
        <parameter key="weight_relation"	value="top k"/>
    </operator>
    <operator name="IOSelector" class="IOSelector">
        <parameter key="io_object"	value="ExampleSet"/>
        <parameter key="select_which"	value="2"/>
    </operator>
    <operator name="CorpusBasedWeightingForNegative" class="CorpusBasedWeighting">
        <parameter key="class_to_characterize"	value="negative"/>
    </operator>
    <operator name="AttributeWeightSelectionForNegative" class="AttributeWeightSelection">
        <parameter key="weight_relation"	value="top k"/>
    </operator>
    <operator name="ExampleSetJoin" class="ExampleSetJoin">
    </operator>
</operator>

Hope that gives you the basic idea.

Cheers,
Ingo

Legacy User

Hi Ingo,

thank you very much for your fast answer...

My Feature Selection Operator chain now looks like this:


  <operator name="Feature Selection" class="OperatorChain" expanded="yes">
        <operator name="IOMultiplier" class="IOMultiplier">
            <parameter key="io_object"	value="ExampleSet"/>
            <parameter key="number_of_copies"	value="2"/>
        </operator>
        <operator name="CorpusBasedWeighting (prof)" class="CorpusBasedWeighting">
            <parameter key="class_to_characterize"	value="purely professional"/>
        </operator>
        <operator name="AttributeWeightSelection (prof)" class="AttributeWeightSelection">
            <parameter key="weight_relation"	value="top k"/>
        </operator>
        <operator name="IOSelector" class="IOSelector">
            <parameter key="io_object"	value="ExampleSet"/>
            <parameter key="select_which"	value="2"/>
        </operator>
        <operator name="CorpusBasedWeighting (pers)" class="CorpusBasedWeighting">
            <parameter key="class_to_characterize"	value="purely personal"/>
        </operator>
        <operator name="AttributeWeightSelection (pers)" class="AttributeWeightSelection">
            <parameter key="weight_relation"	value="top k"/>
        </operator>
        <operator name="ExampleSetJoin" class="ExampleSetJoin">
        </operator>
        <operator name="IOSelector (2)" class="IOSelector">
            <parameter key="io_object"	value="ExampleSet"/>
            <parameter key="select_which"	value="3"/>
        </operator>
        <operator name="CorpusBasedWeighting (pers/prof)" class="CorpusBasedWeighting">
            <parameter key="class_to_characterize"	value="personal, but in professional context"/>
        </operator>
        <operator name="AttributeWeightSelection (pers/prof)" class="AttributeWeightSelection">
            <parameter key="weight_relation"	value="top k"/>
        </operator>
        <operator name="ExampleSetJoin (2)" class="ExampleSetJoin">
        </operator>
    </operator>

Unfortunately, the "IOSelector" operator outputs an IOObject, despite the fact that I've chosen ExampleSet as io_object, whereas operator "CorpusBasedWeighting (pers)" expects an ExampleSet. Do you have an idea where the problem is?

Due to the three classes that I have, I need to join three example sets. Is it right, that ExampleSetJoin is only capable of joining two ExampleSets. If so, is my "workaround" above correct?

Thank you very much!
Erich

MuehliMan

Hi there,

As you pointed out the Genetic Feature Selection as well as the Brute Force both have an option to set the number of attributes selected. But unfortunately these features always result in a memory overflow. The dataset I am proccessing is probably too big.

Regarding your tip with the maximal fitness: what is the fitness criteria? is it a squared correlation coefficient (that would be over 0.8 for example) or can i find the fitness somewhere in the log?

I heard about a feature that overgoes the problem with the memory... i don't know it exactly, i think it was GridParameterOptimisation. Should I try this one?

Last one: Isn't there a feature that lists the best correlating features (with the cummulative squared corr.), like it is done in a forward-stepping regression?

One final comment on the software for all those that read through the posts here. Although I ask questions here it is probably the most comprehensive software package on this topic I have seen so far.

Greets,
Markus

IngoRM

Hi Erich, hi Marcus,

@Erich

Unfortunately, the "IOSelector" operator outputs an IOObject, despite the fact that I've chosen ExampleSet as io_object, whereas operator "CorpusBasedWeighting (pers)" expects an ExampleSet. Do you have an idea where the problem is?

Yes. After the first join you have only 2 instead of 3 example sets. So the second "IOSelector (2)" is not able to deliver the third example set - it is no longer there. Simply change the parameter "select_which" of "IOSelector (2)" to "2" instead of "3" and the process will run fine.

Due to the three classes that I have, I need to join three example sets. Is it right, that ExampleSetJoin is only capable of joining two ExampleSets. If so, is my "workaround" above correct?

Both answers: yes

@Markus

As you pointed out the Genetic Feature Selection as well as the Brute Force both have an option to set the number of attributes selected. But unfortunately these features always result in a memory overflow. The dataset I am proccessing is probably too big.

How many attributes do you have? And what was the value of the parameter? For example, if you want to select the best 10 out of 100 features, the brute force operator will generate 100 over 10 combinations which is (100 * 99 * ... * 91) / (10 * 9 * ... * 1) = 1.7 * 10^13 - that is simply too much to keep in memory even if only the views are stored and no data copies. For these tasks the FeatureSubsetIterator was written (in combination with the ProcessLog operator you can get similar results but in an iterative manner).

For the genetic algorithm: you should try to lower the population size - this often helps for memory problems here. You could try that.

Regarding your tip with the maximal fitness: what is the fitness criteria? is it a squared correlation coefficient (that would be over 0.8 for example) or can i find the fitness somewhere in the log?

It is simply the main criterion of the performance delivered by the inner operators. If you use, for example, a cross validation inside of the genetic algorithm and the operator "ClassificationPerformance" as the fitness calculator, you can specify the main criterion as a parameter. Here is a small example:


<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="number_examples"	value="200"/>
        <parameter key="target_function"	value="sum classification"/>
    </operator>
    <operator name="NoiseGenerator" class="NoiseGenerator">
        <parameter key="label_noise"	value="0.0"/>
        <list key="noise">
        </list>
        <parameter key="random_attributes"	value="5"/>
    </operator>
    <operator name="GeneticAlgorithm" class="GeneticAlgorithm" expanded="yes">
        <parameter key="maximal_fitness"	value="0.95"/>
        <parameter key="maximum_number_of_generations"	value="50"/>
        <parameter key="population_size"	value="2"/>
        <operator name="XValidation" class="XValidation" expanded="yes">
            <parameter key="sampling_type"	value="shuffled sampling"/>
            <operator name="JMySVMLearner" class="JMySVMLearner">
            </operator>
            <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                <operator name="ModelApplier" class="ModelApplier">
                    <list key="application_parameters">
                    </list>
                </operator>
                <operator name="ClassificationPerformance" class="ClassificationPerformance">
                    <parameter key="accuracy"	value="true"/>
                    <list key="class_weights">
                    </list>
                    <parameter key="classification_error"	value="true"/>
                    <parameter key="main_criterion"	value="accuracy"/>
                    <parameter key="spearman_rho"	value="true"/>
                </operator>
            </operator>
        </operator>
    </operator>
</operator>

The feature selection is stopped after 95% accuracy (defined as the main criterion in the ClassificationPerformance operator) was reached.

Last one: Isn't there a feature that lists the best correlating features (with the cummulative squared corr.), like it is done in a forward-stepping regression?

You can combine a forward feature selection with a correlation based subset evaluation. Here is the basic setup:


<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="number_examples"	value="200"/>
        <parameter key="target_function"	value="sum classification"/>
    </operator>
    <operator name="NoiseGenerator" class="NoiseGenerator">
        <parameter key="label_noise"	value="0.0"/>
        <list key="noise">
        </list>
        <parameter key="random_attributes"	value="5"/>
    </operator>
    <operator name="FeatureSelection" class="FeatureSelection" expanded="yes">
        <operator name="CFSFeatureSetEvaluator" class="CFSFeatureSetEvaluator">
        </operator>
    </operator>
</operator>

One final comment on the software for all those that read through the posts here. Although I ask questions here it is probably the most comprehensive software package on this topic I have seen so far.

Thanks for your kind words. We highly appreciate those.

Cheers,
Ingo

Legacy User

Hi Ingo,

thank you for your patience and support. Unfortunately, I still don't have the result I am looking for. My process currently looks like this:


<operator name="Root" class="Process" expanded="yes">
    <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
        <parameter key="database_url"	value="jdbc:mysql://localhost:3306/mail"/>
        <parameter key="id_attribute"	value="id"/>
        <parameter key="label_attribute"	value="label"/>
        <parameter key="query"	value="SELECT * FROM `temp`"/>
        <parameter key="username"	value="root"/>
    </operator>
    <operator name="IOMultiplier" class="IOMultiplier">
        <parameter key="io_object"	value="ExampleSet"/>
        <parameter key="number_of_copies"	value="2"/>
    </operator>
    <operator name="Pers/Prof Attributes" class="OperatorChain" expanded="no">
        <operator name="StringTextInput (4)" class="StringTextInput" expanded="no">
            <parameter key="filter_nominal_attributes"	value="true"/>
            <list key="namespaces">
            </list>
            <operator name="StringTokenizer (4)" class="StringTokenizer">
            </operator>
        </operator>
        <operator name="CorpusBasedWeightingForPersProf" class="CorpusBasedWeighting">
            <parameter key="class_to_characterize"	value="personal, but in professional context"/>
        </operator>
        <operator name="AttributeWeightSelectionForPersProf" class="AttributeWeightSelection">
            <parameter key="k"	value="2"/>
            <parameter key="weight_relation"	value="top k"/>
        </operator>
    </operator>
    <operator name="IOSelector" class="IOSelector">
        <parameter key="io_object"	value="ExampleSet"/>
        <parameter key="select_which"	value="2"/>
    </operator>
    <operator name="Professional Attributes" class="OperatorChain" expanded="no">
        <operator name="StringTextInput (3)" class="StringTextInput" expanded="no">
            <parameter key="filter_nominal_attributes"	value="true"/>
            <list key="namespaces">
            </list>
            <operator name="StringTokenizer (3)" class="StringTokenizer">
            </operator>
        </operator>
        <operator name="CorpusBasedWeightingForProfessional" class="CorpusBasedWeighting">
            <parameter key="class_to_characterize"	value="purely professional"/>
        </operator>
        <operator name="AttributeWeightSelectionForProfessional" class="AttributeWeightSelection">
            <parameter key="k"	value="2"/>
            <parameter key="weight_relation"	value="top k"/>
        </operator>
    </operator>
    <operator name="ExampleSetJoin" class="ExampleSetJoin">
    </operator>
    <operator name="IOSelector 2" class="IOSelector">
        <parameter key="io_object"	value="ExampleSet"/>
        <parameter key="select_which"	value="2"/>
    </operator>
    <operator name="Personal Attributes" class="OperatorChain" expanded="no">
        <operator name="StringTextInput" class="StringTextInput" expanded="no">
            <parameter key="filter_nominal_attributes"	value="true"/>
            <list key="namespaces">
            </list>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
        </operator>
        <operator name="CorpusBasedWeightingForPersonal" class="CorpusBasedWeighting">
            <parameter key="class_to_characterize"	value="purely personal"/>
        </operator>
        <operator name="AttributeWeightSelectionForPersonal" class="AttributeWeightSelection">
            <parameter key="k"	value="2"/>
            <parameter key="weight_relation"	value="top k"/>
        </operator>
    </operator>
    <operator name="ExampleSetJoin (3)" class="ExampleSetJoin">
    </operator>
    <operator name="ExampleVisualizer" class="ExampleVisualizer" breakpoints="after">
    </operator>
</operator>

My database table looks like this:

ID;LABEL;BODY
"1";"purely personal";"du freund schwester"
"2";"purely personal";"du"
"3";"purely personal";"hallo"
"4";"purely personal";"du freund"
"5";"purely personal";"hobby"
"6";"purely professional";"arbeit büro"
"7";"purely professional";"arbeit büro"
"8";"purely professional";"werktag"
"9";"purely professional";"meeting"
"10";"purely professional";"meeting"
"11";"purely personal";"du hobby"
"12";"purely personal";"schwester"
"15";"purely personal";"freund"
"13";"purely personal";"schwester"
"14";"purely personal";"du"
"16";"purely personal";"hobby"
"17";"purely personal";"hobby"
"18";"personal, but in professional context";"gute gemacht"
"19";"personal, but in professional context";"gut gemacht"
"20";"personal, but in professional context";"schlecht gemacht"
"21";"personal, but in professional context";"bericht erstellen"
"22";"personal, but in professional context";"bericht erstellen"

As you can see, I am using only label-specific words inside my bodies (e.g. bericht, erstellen, schlecht, gemacht, gut for label "personal, but in professional context). As stated above, I want to extract the two best keywords according to the tfidf-criterion for each class, in order to get 3*2=6 attributes for my classification task.

My specific testing dataset should lead to six different attributes. Sadly, I get only 4 attributes. Do you have an idea? Does the CorpusBasedWeighting operator expect an ExampleSet as input, which contains only data that is labeled with the "class_to_characterize" parameter? If so, then I would have to add an attribute-value filter to my input data but, in this case, the ExampleSetJoin wouldn't work.

Thank you very, very much!

IngoRM

Hi Erich,

I was able to reproduce the fact that only 4 attributes were selected. Then I placed breakpoints after the corpus based weighting and have seen why: this weighting unfortunately produces high weights for values with low tfidf weights and vice versa. So we do not have to select the "top k" weights but the "bottom k" weights and voila: and seems to work now.

Here is the complete setup (I placed the calculation of TFIDF at the beginning and have read the sample data (thanks for that) from a CSV file:


<operator name="Root" class="Process" expanded="yes">
    <operator name="CSVExampleSource" class="CSVExampleSource">
        <parameter key="filename"	value="C:\Dokumente und Einstellungen\Mierswa\Eigene Dateien\rm_workspace\keywords.txt"/>
        <parameter key="id_name"	value="ID"/>
        <parameter key="label_name"	value="LABEL"/>
    </operator>
    <operator name="StringTextInput (4)" class="StringTextInput" expanded="no">
        <parameter key="filter_nominal_attributes"	value="true"/>
        <list key="namespaces">
        </list>
        <parameter key="remove_original_attributes"	value="true"/>
        <operator name="StringTokenizer (4)" class="StringTokenizer">
        </operator>
    </operator>
    <operator name="IOMultiplier" class="IOMultiplier">
        <parameter key="io_object"	value="ExampleSet"/>
        <parameter key="number_of_copies"	value="2"/>
    </operator>
    <operator name="Pers/Prof Attributes" class="OperatorChain" expanded="no">
        <operator name="CorpusBasedWeightingForPersProf" class="CorpusBasedWeighting">
            <parameter key="class_to_characterize"	value="personal, but in professional context"/>
        </operator>
        <operator name="AttributeWeightSelectionForPersProf" class="AttributeWeightSelection">
            <parameter key="k"	value="2"/>
            <parameter key="weight_relation"	value="bottom k"/>
        </operator>
    </operator>
    <operator name="IOSelector" class="IOSelector">
        <parameter key="io_object"	value="ExampleSet"/>
        <parameter key="select_which"	value="2"/>
    </operator>
    <operator name="Professional Attributes" class="OperatorChain" expanded="no">
        <operator name="CorpusBasedWeightingForProfessional" class="CorpusBasedWeighting">
            <parameter key="class_to_characterize"	value="purely professional"/>
        </operator>
        <operator name="AttributeWeightSelectionForProfessional" class="AttributeWeightSelection">
            <parameter key="k"	value="2"/>
            <parameter key="weight_relation"	value="bottom k"/>
        </operator>
    </operator>
    <operator name="ExampleSetJoin" class="ExampleSetJoin">
    </operator>
    <operator name="IOSelector 2" class="IOSelector">
        <parameter key="io_object"	value="ExampleSet"/>
        <parameter key="select_which"	value="2"/>
    </operator>
    <operator name="Personal Attributes" class="OperatorChain" expanded="no">
        <operator name="CorpusBasedWeightingForPersonal" class="CorpusBasedWeighting">
            <parameter key="class_to_characterize"	value="purely personal"/>
        </operator>
        <operator name="AttributeWeightSelectionForPersonal" class="AttributeWeightSelection">
            <parameter key="k"	value="2"/>
            <parameter key="weight_relation"	value="bottom k"/>
        </operator>
    </operator>
    <operator name="ExampleSetJoin (3)" class="ExampleSetJoin">
    </operator>
    <operator name="ExampleVisualizer" class="ExampleVisualizer">
    </operator>
</operator>

Please be aware that in a future version this behaviour is likely to be changed (large TFIDFs --> large weights) and then you will have to change back to "top k".

Cheers,
Ingo

Legacy User

Hi Ingo,

again: thank you very much for your effort! It worked out with parameter "bottom k". I will pay attention to switch this parameter if there will be changes in future releases.

I have a feature space now, great!

I'm sorry that I have to ask you one last time for support, because I fail in saving/loading the feature space (i.e. keyword set). I use the AttributeConstructionsWriter to save the determined attributes by the process we devised before. When I load a new ExampleSet, I have to tokenize the String and afterwards select the subset of attributes that were saved before. Neither FeatureGenerationOperator nor AttributeConstructionsLoader delivered reasonable results. In fact, they discarded all attributes except ID and LABEL. The result was, that all examples were classified into the same class. I think the problem is the reconstruction of my deduced attributes.

How can I subselect attributes according to the file created by AttributeConstructionsWriter?

I hope my explanation wasn't that confusing... in short: I want to save my deduced keywords and "apply" (in the sense of feature selection) them to another ExampleSet.

Kind regards,
Erich

PS: I really tried several approaches to work this out without success. I am confident that I won't annoy you any longer ;-)

IngoRM

Hi Erich,

the answer is simple: don't use the AttributeConstructionsWriter for this. This operator should only be used to save the descriptions of how to construct new attributes (for example found with operators like YAGGA2). For feature selection, you should create an AttributeWeights object corresponding to your feature selection, write down the AttributeWeights (AttributeWeightsWriter), read them back with the AttributeWeightsLoader and re-create the feature selection with the AttributeWeightSelection.

Here is an example:


<operator name="Root" class="Process" expanded="yes">
    <operator name="WeightCreationData" class="ExampleSetGenerator">
        <parameter key="target_function"	value="sum classification"/>
    </operator>
    <operator name="ExampleSet2AttributeWeights" class="ExampleSet2AttributeWeights">
    </operator>
    <operator name="AttributeWeightsWriter" class="AttributeWeightsWriter">
        <parameter key="attribute_weights_file"	value="C:\home\ingo\rm_workspace\selection_weights.wgt"/>
    </operator>
    <operator name="DataConsumer" class="IOConsumer">
        <parameter key="io_object"	value="ExampleSet"/>
    </operator>
    <operator name="WeightsConsumer" class="IOConsumer" breakpoints="after">
        <parameter key="io_object"	value="AttributeWeights"/>
    </operator>
    <operator name="AttributeWeightsLoader" class="AttributeWeightsLoader">
        <parameter key="attribute_weights_file"	value="C:\home\ingo\rm_workspace\selection_weights.wgt"/>
    </operator>
    <operator name="WeightApplicationData" class="ExampleSetGenerator" breakpoints="after">
        <parameter key="number_of_attributes"	value="10"/>
        <parameter key="target_function"	value="sum classification"/>
    </operator>
    <operator name="AttributeWeightSelection" class="AttributeWeightSelection">
    </operator>
</operator>

Please note that the IOConsumers are only used to simulate the writing and loading in two different processes. The important operators are "AttributeWeightsWriter", "AttributeWeightsLoader", and "AttributeWeightSelection".

Cheers,
Ingo

montaqi

What will some of the operators used in this thread be in the current rapidminer?