"How to collect each performance of a backward elimination"

JohnQuest
JohnQuest New Altair Community Member
edited November 5 in Community Q&A
I am using rapid miner for my data mining research, I used backward elimination for my feature (attribute) selections. I was wondering how to set up the process in order to gather each performance for the backward elimination. For example: feature set one (A, B, C , D, E, F), performance one(…); feature set two(A, B, C, D, E), performance two(…); ….

I am currently processing a data table with 21 features and 157000 items. A brute force feature selection simply overload my computer memory. I was wonder how to find the best combination as well as plot a graph that shows which combination of features performance low, and which combination performance high.

Thanks in advance for your kindly support. :)

Answers

  • land
    land New Altair Community Member
    Hi,
    you can use the logging mechanism to log the results of all rounds and plot them after this with the usual plotters of rapid miner.
    Here's a process, that will illustrate how this works.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="251" width="748">
          <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="75">
            <parameter key="repository_entry" value="//Samples/data/Labor-Negotiations"/>
          </operator>
          <operator activated="true" class="optimize_selection_backward" expanded="true" height="94" name="Backward Elimination" width="90" x="180" y="30">
            <parameter key="maximal_number_of_eliminations" value="100"/>
            <parameter key="speculative_rounds" value="100"/>
            <parameter key="stopping_behavior" value="with decrease of more than"/>
            <parameter key="maximal_relative_decrease" value="1.0"/>
            <process expanded="true" height="581" width="764">
              <operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="45" y="30">
                <process expanded="true" height="581" width="357">
                  <operator activated="true" class="naive_bayes" expanded="true" height="76" name="Naive Bayes" width="90" x="112" y="30"/>
                  <connect from_port="training" to_op="Naive Bayes" to_port="training set"/>
                  <connect from_op="Naive Bayes" from_port="model" to_port="model"/>
                  <portSpacing port="source_training" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                  <portSpacing port="sink_through 1" spacing="0"/>
                </process>
                <process expanded="true" height="581" width="357">
                  <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                    <list key="application_parameters"/>
                  </operator>
                  <operator activated="true" class="performance_classification" expanded="true" height="76" name="Performance" width="90" x="179" y="30">
                    <list key="class_weights"/>
                  </operator>
                  <connect from_port="model" to_op="Apply Model" to_port="model"/>
                  <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
                  <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
                  <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
                  <portSpacing port="source_model" spacing="0"/>
                  <portSpacing port="source_test set" spacing="0"/>
                  <portSpacing port="source_through 1" spacing="0"/>
                  <portSpacing port="sink_averagable 1" spacing="0"/>
                  <portSpacing port="sink_averagable 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="log" expanded="true" height="76" name="Log" width="90" x="313" y="30">
                <list key="log">
                  <parameter key="numberOfAttributes" value="operator.Backward Elimination.value.number of attributes"/>
                  <parameter key="round" value="operator.Validation.value.applycount"/>
                  <parameter key="performance" value="operator.Validation.value.performance"/>
                  <parameter key="deviation" value="operator.Validation.value.deviation"/>
                </list>
              </operator>
              <connect from_port="example set" to_op="Validation" to_port="training"/>
              <connect from_op="Validation" from_port="averagable 1" to_op="Log" to_port="through 1"/>
              <connect from_op="Log" from_port="through 1" to_port="performance"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve" from_port="output" to_op="Backward Elimination" to_port="example set"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>
    Greetings,
      Sebastian
  • JohnQuest
    JohnQuest New Altair Community Member
    Dear Sebastian

                            thanks for your quick reply, I was wondering is there a way of loading the XML code and let the repaid miner show the process, thanks for your support.


    John
  • JohnQuest
    JohnQuest New Altair Community Member
    Dear Sebastian
                              I can not find operator "Process" in my rapid miner, I have a class of process control, in it, we have loops, etc. not sure we have the same version. I am using 5.0.003

    John
  • land
    land New Altair Community Member
    Hi John,
    indeed we don't have the same version. The latest version is 5.0.006 and I would suggest updating if possible. But this isn't the reason for the missing "Process" operator: This one cannot be added by users, since it represents the complete process and is added automatically if creating a new process.
    If you want to use my posted process, copy it from here and paste it into the XML View of RapidMiner. After pressing the apply button, the process will be reconstructed from this xml fragment.

    Greetings,
      Sebastian
  • JohnQuest
    JohnQuest New Altair Community Member
    Dear Sebastian
                It worked, thanks a lot. May I ask another question, how to set up an automatic sampling with x-fold cross validation.
    For example, a data set contain label X(6000 items), label Y(500 items). A 10-fold cross validation split the data to 650 for each fold, we use 9 folds to training and 1 fold for testing. For each fold of the training set, we want to balance the label X and Label Y.
    For example, fold 1 has label Y(50) and label X(600), so we sample 50 out of label X in fold 1 and correct the new sampled fold 1 as label Y(50) and label X(50), same for the rest of 8 folds. Then we use the 9 sampled folds to training and use the 1 unbalanced fold to testing, the expirment loops the training and testing set for all 10 folds and collect the final performance.
    Thanks for your kindly support.


    Best Regards

    John Quest
  • land
    land New Altair Community Member
    Hi John,
    well, this seems to be rather difficult without coding. Anyway it could be possible to achieve it. You could build your own small XValidation just by using operators. I will line up the steps here, but it's definitvely beyond the scope of this free support forum to build it for you:
    1. Generate a new attribute that will distribute the examples over the folds
    2. Loop over each value of this attribute
      2.1 Copy the data set and filter it according to the current value of the previously generated fold attribute: One set matching the value, the other containing non matching.
    2.2 Learn the model on the non matching
    2.3 apply it on the matching.
    2.4 measure performance and store anywhere with regarding to fold number
    3. Average all performance measurements

    This way it could be achieved. Or you ask for a quote for such an extension of the XValidation and would donate this to the general functionality :)

    Greetings,
      Sebastian