"Weka - Random Forest"

dragonedison
dragonedison New Altair Community Member
edited November 5 in Community Q&A
Dear everyone,

I have a training set and a test set, each has 130 attributes. I apply Weka-Random Forest to train the training set with all the attributes. The program selects 8 attributes of the set and generates 100% accuracy for the training set, however its performance is rather poor for the test set ---- only 53.7% accuracy.

Then I try to train the training set with only one attribute each time and then apply each of the 130 classifiers to the test set, and I discover that some of these classifiers are able to produce 80% accuracy for the test set, although their performances are not the best among the 130 classifiers for the training set.

What I want to know is, how can I train an even better classifier for the test set using those attributes that can produce 80% accuracy(of course, I can't apply the test set to train the classifier). Should I just simply choose the good attributes and put them into the Random Forest training, or are there any better ways to implement this?

Thanks.
Gary
Tagged:

Answers

  • MuehliMan
    MuehliMan New Altair Community Member
    Hi dragonedisonn

    Could you please post the workflow to clearify the process you are working with? I am missing for example the number of trees you are using and the maximal treedepth. This highly influences the accuracy.

    Is there a reason why you use the Weka Random Forest instead of the (normal RapidMiner) Random Forest? I am second one and am totally satisifyied with it.

    Cheers,
    Markus
  • dragonedison
    dragonedison New Altair Community Member
    Dear Markus,

    image
    image

    I use 100 trees to grow because I read some articles that this number of trees best trades off between accuracy and computation time; and the depth of the tree is unlimited.

    The process is as shown in the image.

    The reason why I choose Weka-RF is that the RF provided by RapidMiner would produce memory error for my dataset, so I have to use the Weka one.

    Regards,
    Gary
  • MuehliMan
    MuehliMan New Altair Community Member
    Those memory problem sound strange to me. I used RF with more than 100 trees without problems. I think the problem could be due to the unlimited treedepth.
    In my mind this is also the reason for the 100% accuracy with 8 attributes.

    BTW: The images are not working at my PC,  could you try sending the process as XML.

    Best,
    Markus
  • dragonedison
    dragonedison New Altair Community Member
    Dear Markus,

    Please refer to these links for the images.
    http://img307.ph.126.net/r_HvVZDAI1Xg_dMgHYNuNQ==/4786763453941444218.jpg
    http://img307.ph.126.net/NWAYQdbrX1kwscyTNjAKBg==/4786763453941444209.jpg

    I did not use unlimited tree depth for the RapidMiner random forest. I used depth 20, but I generate more than 100 trees for 10000 data, about 500 trees.

    I would like to know why unlimited depth of trees will generate 100% accuracy, so how many depths should I use?

    Regards,
    Gary
  • dragonedison
    dragonedison New Altair Community Member
    Dear everyone,

    There is another problem I am concerned with the Weka Random Forest. When I use the Performance operator to obtain the classification accuracy of the model(see the upper post for the process image), I get two kinds of classification results, namely "Multiclass Classification Performance" and "Binary Classification Performance". The "Binary Classification" performance is rather poor. I would like to know what the difference of these two performance is? Both my training and test data are bi-class data.

    Thanks.
    Gary
  • MuehliMan
    MuehliMan New Altair Community Member
    Hi again,

    Unfortunately the links are not working too, so I still have to assume a little bit.

    First of all Binary Classification Performance is for Binominal Attributes (which means that there obviously only two values). Multiclass Performance is for classification in more classes. Maybe there is something wring with your label, because if you have only two classes, those two values should be the same.

    OK, for such a big dataset, I would recommend to reduce the number of trees. My PC stuck with about 1200 trees on 250 examples. Try to start with something like 100 and increase the number of trees stepwise to 1000 when the performance is poor. Thereby you can also see how the performance changes and where the memory limit is.

    Hope this helps!

    Best,
    Markus
  • dragonedison
    dragonedison New Altair Community Member
    Dear Markus,

    Thank you for giving so many important advices. I decide to paste the XML here for your reference.

    Regards,
    Gary

    I
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.0.10" expanded="true" name="Process">
        <process expanded="true" height="460" width="748">
          <operator activated="true" class="retrieve" compatibility="5.0.10" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
            <parameter key="repository_entry" value="../Data/ALLAtts/SurfacePatch/65_1W_OverSam"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="5.0.10" expanded="true" height="76" name="Select Attributes" width="90" x="45" y="120">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="bind|cMax"/>
          </operator>
          <operator activated="true" class="loop_attribute_subsets" compatibility="5.0.10" expanded="true" height="60" name="Loop Subsets" width="90" x="197" y="217">
            <parameter key="use_exact_number" value="true"/>
            <parameter key="exact_number_of_attributes" value="1"/>
            <parameter key="parallelize_subprocess" value="true"/>
            <process expanded="true" height="460" width="614">
              <operator activated="true" class="retrieve" compatibility="5.0.10" expanded="true" height="60" name="Retrieve (3)" width="90" x="45" y="300">
                <parameter key="repository_entry" value="//Gary/Data/ALLAtts/SurfacePatch/65test"/>
              </operator>
              <operator activated="true" class="weka:W-RandomForest" compatibility="5.0.1" expanded="true" height="76" name="W-RandomForest" width="90" x="45" y="120">
                <parameter key="I" value="100.0"/>
                <parameter key="K" value="1.0"/>
                <parameter key="depth" value="0"/>
              </operator>
              <operator activated="true" class="apply_model" compatibility="5.0.10" expanded="true" height="76" name="Apply Model" width="90" x="179" y="120">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="5.0.10" expanded="true" height="76" name="Performance" width="90" x="315" y="30"/>
              <operator activated="true" class="write_as_text" compatibility="5.0.10" expanded="true" height="76" name="Write as Text" width="90" x="447" y="30">
                <parameter key="result_file" value="result(rf1WOverSamTest).dat"/>
              </operator>
              <operator activated="true" class="apply_model" compatibility="5.0.10" expanded="true" height="76" name="Apply Model (2)" width="90" x="179" y="210">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="5.0.10" expanded="true" height="76" name="Performance (2)" width="90" x="313" y="210"/>
              <operator activated="true" class="write_as_text" compatibility="5.0.10" expanded="true" height="76" name="Write as Text (2)" width="90" x="447" y="210">
                <parameter key="result_file" value="result(rfALLtest).dat"/>
              </operator>
              <connect from_port="example set" to_op="W-RandomForest" to_port="training set"/>
              <connect from_op="Retrieve (3)" from_port="output" to_op="Apply Model (2)" to_port="unlabelled data"/>
              <connect from_op="W-RandomForest" from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_op="W-RandomForest" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Apply Model" from_port="model" to_op="Apply Model (2)" to_port="model"/>
              <connect from_op="Performance" from_port="performance" to_op="Write as Text" to_port="input 1"/>
              <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
              <connect from_op="Performance (2)" from_port="performance" to_op="Write as Text (2)" to_port="input 1"/>
              <portSpacing port="source_example set" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Loop Subsets" to_port="example set"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>