Following up the Model Applier problems of the past in terms of internal nominal mappings, I am still having problems! It seems that Rapidminer is having trouble with Nominal values that are not first in the list in the aml files with the model applier.
Following the work-around in the first step I load my training data (attached) from an excel file, write it out with ExampleSetWriter, load it back in with ExampleSource, create a model and then write the model:
<operator name="Root" class="Process" expanded="yes">
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="D:\ADDU\Share\RapidMiner\RapidTrain.xls"/>
<parameter key="first_row_as_names" value="true"/>
<parameter key="create_label" value="true"/>
<parameter key="label_column" value="9"/>
<parameter key="create_id" value="true"/>
<parameter key="id_column" value="8"/>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="D:\ADDU\Share\Rapidminer\train.dat"/>
<parameter key="attribute_description_file" value="D:\ADDU\Share\Rapidminer\train.aml"/>
<parameter key="overwrite_mode" value="overwrite"/>
</operator>
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="D:\ADDU\Share\RapidMiner\train.aml"/>
</operator>
<operator name="W-J48" class="W-J48">
</operator>
<operator name="ModelWriter" class="ModelWriter">
<parameter key="model_file" value="D:\ADDU\Share\Rapidminer\J48.mod"/>
<parameter key="output_type" value="XML"/>
</operator>
</operator>
Next, I read in a test set consisting of a single example from an excel file (temp.xls) and write it out with the example set writer. I guess this step isn't strictly necessary but it is helpful in what is to come:
<operator name="Root" class="Process" expanded="yes">
<operator name="ExcelExampleSource" class="ExcelExampleSource" breakpoints="after">
<parameter key="excel_file" value="D:\ADDU\Share\Rapidminer\temp.xls"/>
<parameter key="first_row_as_names" value="true"/>
<parameter key="create_label" value="true"/>
<parameter key="label_column" value="9"/>
<parameter key="create_id" value="true"/>
<parameter key="id_column" value="8"/>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="D:\ADDU\Share\Rapidminer\temp.dat"/>
<parameter key="attribute_description_file" value="D:\ADDU\Share\Rapidminer\temp.aml"/>
<parameter key="overwrite_mode" value="overwrite"/>
</operator>
</operator>
THIS PART IS THE WORKAROUND: I now manually open train.aml and temp.aml. I copy all of the attribute information from train.aml over the attribute information in temp.aml so that all of the attribute information in both files is exactly the same.
In the third part I apply the model to a new instance of test data, for this run I have used the same temp.xls. This is what I call for my real world prediction stuff. I load the temp.xls, then using ExampleSetWriter I only write out the temp.dat file so as to preserve all of the correct attribute information copied in the workaround above. I have stuck in an IOConsumer just as a control method for testing.
I then load the test example using ExampleSource to load temp.aml. I have a FeatureIterator to scrub out any missing data which in our set is represented with 999, I load the model and apply it and then write out the prediction.
<operator name="Root" class="Process" expanded="yes">
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="D:\ADDU\Share\Rapidminer\temp.xls"/>
<parameter key="first_row_as_names" value="true"/>
<parameter key="create_label" value="true"/>
<parameter key="label_column" value="9"/>
<parameter key="create_id" value="true"/>
<parameter key="id_column" value="8"/>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="D:\ADDU\Share\Rapidminer\temp.dat"/>
<parameter key="overwrite_mode" value="overwrite"/>
</operator>
<operator name="IOConsumer" class="IOConsumer">
<parameter key="io_object" value="ExampleSet"/>
</operator>
<operator name="ExampleSource" class="ExampleSource" breakpoints="after">
<parameter key="attributes" value="D:\ADDU\Share\RapidMiner\temp.aml"/>
</operator>
<operator name="FeatureIterator" class="FeatureIterator" expanded="yes">
<parameter key="work_on_input" value="false"/>
<operator name="Mapping" class="Mapping">
<parameter key="attributes" value="%{loop_feature}"/>
<list key="value_mappings">
</list>
<parameter key="replace_what" value="999"/>
<parameter key="replace_by" value="?"/>
</operator>
</operator>
<operator name="ModelLoader" class="ModelLoader">
<parameter key="model_file" value="D:\ADDU\Share\Rapidminer\J48.mod"/>
</operator>
<operator name="ModelApplier" class="ModelApplier" breakpoints="after">
<list key="application_parameters">
</list>
</operator>
<operator name="ExcelExampleSetWriter" class="ExcelExampleSetWriter">
<parameter key="excel_file" value="D:\ADDU\Share\Rapidminer\RapidminerPrediction.xls"/>
</operator>
</operator>
Now here is the problem. The output file has only ? where there should be data!
For example,
SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP UR SUCCESS
Female Current Long-Term Senior (Yr 12) Unemployed Own Home CBT only 191.00
becomes this in the output:
UR SUCCESS SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP prediction(SUCCESS)
191.0 ? Current Long-Term ? ? ? CBT only Unsuccessful
confidence(Unsuccessful) confidence(Successful)
.7 .3
Now, lets take a look at the .aml files. You will notice below that the only nominal variable that is being written out is MARSTAT, Current Long-Term. It is the only nominal variable which appears [glow=red,2,300]FIRST[/glow] in the aml files. So at least for the writing out after the model applier only the first nominal variables are working.
<?xml version="1.0" encoding="windows-1252"?>
<attributeset default_source="train.dat">
<attribute
name = "SEX"
sourcecol = "1"
valuetype = "nominal">
<value>Male</value>
<value>Female</value>
</attribute>
<attribute
name = "MARSTAT"
sourcecol = "2"
valuetype = "nominal">
<value>Current Long-Term</value>
<value>Previous Long-Term</value>
<value>Single</value>
</attribute>
<attribute
name = "EDUC"
sourcecol = "3"
valuetype = "nominal">
<value>Uni</value>
<value>Senior (Yr 12)</value>
<value>Junior (Yr 10)</value>
<value>Primary</value>
<value>Tertiary (Non-Uni)</value>
</attribute>
<attribute
name = "EMPLOY"
sourcecol = "4"
valuetype = "nominal">
<value>Employed</value>
<value>Unemployed</value>
<value>Student</value>
</attribute>
<attribute
name = "ACCOM"
sourcecol = "5"
valuetype = "nominal">
<value>Rent</value>
<value>Own Home</value>
<value>Other</value>
</attribute>
<attribute
name = "SF36PHY1"
sourcecol = "6"
valuetype = "real"/>
<attribute
name = "GROUP"
sourcecol = "7"
valuetype = "nominal">
<value>CBT only</value>
<value>Combination</value>
<value>Refuseniks</value>
<value>Acamprosate</value>
<value>St Judes</value>
<value>Naltrexone</value>
</attribute>
<id
name = "UR"
sourcecol = "8"
valuetype = "integer"/>
<label
name = "SUCCESS"
sourcecol = "9"
valuetype = "nominal">
<value>Unsuccessful</value>
<value>Successful</value>
</label>
</attributeset>
Now, lets use a test set which only consists of first nominal values (attached as tempfirst, you will have to rename it to temp to use my code above).
It works! Confirming my theory.
UR SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP prediction(SUCCESS) confidence(Unsuccessful) confidence(Successful)
191.0 Male Current Long-Term Uni Employed Rent CBT only Unsuccessful .7 .3
Now with a file where the first nominal value is never present (attached as tempallnotfirst, rename to temp to use) and as expected we have
UR SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP prediction(SUCCESS) confidence(Unsuccessful) confidence(Successful)
191.0 ? ? ? ? ? ? Successful .4 .6
Now, going back to our original temp file we can take a look at the DataTable tab at the end of the experiment: Its a bit messy but I have highlighted a few examples of data that goes missing below for EDUC and EMPLOY. In both cases in the statistics column the mode is unknown! but the information is still available in the range column!!
id UR integer avg = 191 +/- 0 [191.000 ; 191.000] 0.0
prediction prediction(SUCCESS) nominal mode = Unsuccessful (1), least = Successful (0) Unsuccessful (1), Successful (0) 0.0
confidence_Unsuccessful confidence(Unsuccessful) real avg = 0.666 +/- 0 [0.666 ; 0.666] 0.0
confidence_Successful confidence(Successful) real avg = 0.334 +/- 0 [0.334 ; 0.334] 0.0
regular SEX nominal mode = unknown Female (0) 0.0
regular MARSTAT nominal mode = Current Long-Term (1), least = Current Long-Term (1) Current Long-Term (1) 0.0
regular EDUC nominal mode = unknown [glow=red,2,300]Senior (Yr 12) (0)[/glow] 0.0
regular EMPLOY nominal mode = unknown [glow=red,2,300]Unemployed[/glow] (0) 0.0
regular ACCOM nominal mode = unknown Own Home (0) 0.0
regular SF36PHY1 real avg = ? +/- ? [∞ ; -∞] 1.0
regular GROUP nominal mode = CBT only (1), least = CBT only (1) CBT only (1) 0.0
Now, the problem could be in producing the output from the model or in the actual model applier itself.
To try and test if the data is going missing in the model applier I ran the model applier process a few times, each time changing one of the suspect variables to a missing value and found the following predictions:
Original: 1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946
Female Missing: 1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946
EDUC Missing: 1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946
EMPLOY Missing: 1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946
Well, I think you get the picture there. The data for these variables seems to be treated by the model applier as if it is missing.
Am I going mad? Have I missed something obvious?
How do I attach my data files?