"TextMining using LibSVMLearner -- does sort order of Excel input file matter?"

I am using the following code to text-mine a ~10,000 row Excel Record Set. The Excel file has three columns: (1) the label, (2) the text, and (3) the ID.

I have noticed something peculiar -- when I sort the Excel file differently, the model that is produced is dramatically different. For example, if I sort on the label column, RapidMiner produces much better results than if I sort on ID. Should I always be sorting on the label column? I would have thought that RapidMiner would produce the same results on inputs sorted in any manner. Is this a bug? Can I rely on my results after seeing this behavior?

<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\RapidMiner\NPS_PaymentStatus\log.log"/>
<parameter key="resultfile" value="C:\RapidMiner\NPS_PaymentStatus\Result_file.res"/>
<operator name="MemoryCleanUp_START" class="MemoryCleanUp">
</operator>
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\RapidMiner\NPS_PaymentStatus\RapidMiner_PaymentStatus_MASTER_MinusNEUTRALS&BLANKS.xls"/>
<parameter key="first_row_as_names" value="true"/>
<parameter key="create_label" value="true"/>
<parameter key="create_id" value="true"/>
<parameter key="id_column" value="3"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="remove_original_attributes" value="true"/>
<parameter key="prune_below" value="10"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="StopwordFilterFile" class="StopwordFilterFile">
<parameter key="file" value="C:\RapidMiner\NPS_PaymentStatus\STOPWORDS.txt"/>
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
<operator name="LovinsStemmer" class="LovinsStemmer">
</operator>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE.dat"/>
<parameter key="attribute_description_file" value="C:\RapidMiner\NPS_PaymentStatus\ATTRIBUTE_DESCRIPTION_FILE.aml"/>
<parameter key="quote_nominal_values" value="false"/>
</operator>
<operator name="MemoryCleanUp_02" class="MemoryCleanUp">
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<parameter key="keep_example_set" value="true"/>
<parameter key="create_complete_model" value="true"/>
<parameter key="number_of_validations" value="2"/>
<operator name="LibSVMLearner" class="LibSVMLearner">
<parameter key="keep_example_set" value="true"/>
<parameter key="kernel_type" value="linear"/>
<parameter key="degree" value="1"/>
<list key="class_weights">
</list>
<parameter key="calculate_confidences" value="true"/>
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<parameter key="keep_model" value="true"/>
<list key="application_parameters">
</list>
<parameter key="create_view" value="true"/>
</operator>
<operator name="BinominalClassificationPerformance" class="BinominalClassificationPerformance">
<parameter key="keep_example_set" value="true"/>
<parameter key="main_criterion" value="AUC"/>
<parameter key="AUC" value="true"/>
<parameter key="precision" value="true"/>
<parameter key="recall" value="true"/>
<parameter key="lift" value="true"/>
<parameter key="fallout" value="true"/>
<parameter key="f_measure" value="true"/>
<parameter key="false_positive" value="true"/>
<parameter key="false_negative" value="true"/>
<parameter key="true_positive" value="true"/>
<parameter key="true_negative" value="true"/>
<parameter key="sensitivity" value="true"/>
<parameter key="specificity" value="true"/>
<parameter key="youden" value="true"/>
<parameter key="positive_predictive_value" value="true"/>
<parameter key="negative_predictive_value" value="true"/>
<parameter key="psep" value="true"/>
</operator>
<operator name="ECS_ModelResults" class="ExampleSetWriter">
<parameter key="example_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE_MODEL.dat"/>
<parameter key="format" value="special_format"/>
<parameter key="special_format" value="$i $l $p $d"/>
</operator>
<operator name="PerformanceWriter" class="PerformanceWriter">
<parameter key="performance_file" value="C:\RapidMiner\NPS_PaymentStatus\NPS_PaymentStatus.per"/>
</operator>
<operator name="ResultWriter" class="ResultWriter">
<parameter key="result_file" value="C:\RapidMiner\NPS_PaymentStatus\Result_file.res"/>
</operator>
</operator>
</operator>
<operator name="MemoryCleanUp_END" class="MemoryCleanUp">
</operator>
</operator>

Find more posts tagged with

AI Studio

Excel

Accepted answers

All comments

IngoRM

Hi,

your process in general looks good to me (at least from viewing at the XML code alone

)

I would have thought that RapidMiner would produce the same results on inputs sorted in any manner.

Not necessarily. This completely depends on the learning scheme. However, with a 2-fold cross validation alone you can probably not really take any definite statement about the performance of the models. If the dramatic change in prediction performance still is true for a 10 times 10-fold cross validation I would be more worried

Cheers,
Ingo

wotsiznamiz

THX!