How to ensure all nomnal values appear in each slice when doing XValidation?

keith
keith New Altair Community Member
edited November 5 in Community Q&A
Hi,

I am trying to use CrossValidation with Evolutionary Weights and Nearest Neighbor learning as described by Ingo at http://rapid-i.com/rapidforum/index.php/topic,41.msg87.html#msg87 .  Specifically, I have this excerpt:

    <operator name="WrapperXValidation" class="WrapperXValidation" expanded="yes">
        <parameter key="number_of_validations" value="5"/>
        <parameter key="sampling_type" value="shuffled sampling"/>
        <operator name="EvolutionaryWeighting" class="EvolutionaryWeighting" expanded="yes">
            <parameter key="maximum_number_of_generations" value="20"/>
            <parameter key="p_crossover" value="0.5"/>
            <parameter key="population_size" value="2"/>
            <operator name="XValidation" class="XValidation" expanded="yes">
                <parameter key="number_of_validations" value="5"/>
                <operator name="WeightLearner" class="NearestNeighbors">
                    <parameter key="k" value="10"/>
                    <parameter key="weighted_vote" value="true"/>
                </operator>
                <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                    <operator name="ModelApplier" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="Performance" class="Performance">
                    </operator>
                </operator>
            </operator>
        </operator>
        <operator name="WeightedModelLearner" class="NearestNeighbors">
            <parameter key="k" value="10"/>
            <parameter key="weighted_vote" value="true"/>
        </operator>
        <operator name="WeightedApplierChain" class="OperatorChain" expanded="yes">
            <operator name="WeightedModelApplier" class="ModelApplier">
                <list key="application_parameters">
                </list>
                <parameter key="keep_model" value="true"/>
            </operator>
            <operator name="WeightedPerformance" class="Performance">
            </operator>
        </operator>
    </operator>
This works for me as long as all the features are attributes are numerical.  However, I have a couple of nominal attributes I want to include, but when I try to include them, I get:

AttributeTypeException Process failed Message: Attribute 'myNomAttrib': Cannot map index of nominal attribute to nominal value: index -1 is out of bounds!

What I think is happening is that when the ModelApplier node inside the XValidation node executes, sometimes the holdout data contains a nominal value for the myNomAttrib attribute that did not occur in the training data, and that is causing the ModelApplier to fail.

If my assessment is correct, how can I avoid this situation?  My first inclination was to use stratified sampling, but that only appears to work for nominal labels, not nominal attributes.

Thanks,
Keith
Tagged:

Answers

  • IngoRM
    IngoRM New Altair Community Member
    Hello Keith,

    thanks for pointing this out. I needed some time to find a data set where this occurs (it is of course more likely for smaller data sets with lots of nominal values) and I can confirm this problem. You were right that the re-mapping between training and test set was not possible in those cases. We fixed this by using simply the internally used  value from the test set if it was not known by the training set. We have added this fix to this to the CVS version which will of course also be available in the next release and in the next update of the RapidMiner Enterprise Edition.

    By the way: we are currently planning a revise of the RM data core which will cover two important aspects: 1) we will provide the possibility of working on data sets of arbitrary sizes without the need of external databases by providing a new data  access and caching mechanism and 2) we will get rid of the internal mappings for nominal values which often cause compatibility problems like those and huge development efforts to get everything right. This new data core will be part of the upcoming version 5.0 of RapidMiner.

    However, for now the fixed version should solve your problem.

    Cheers,
    Ingo