"TextMining using LibSVMLearner -- does sort order of Excel input file matter?"
wotsiznamiz
New Altair Community Member
I am using the following code to text-mine a ~10,000 row Excel Record Set. The Excel file has three columns: (1) the label, (2) the text, and (3) the ID.
I have noticed something peculiar -- when I sort the Excel file differently, the model that is produced is dramatically different. For example, if I sort on the label column, RapidMiner produces much better results than if I sort on ID. Should I always be sorting on the label column? I would have thought that RapidMiner would produce the same results on inputs sorted in any manner. Is this a bug? Can I rely on my results after seeing this behavior?
<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\RapidMiner\NPS_PaymentStatus\log.log"/>
<parameter key="resultfile" value="C:\RapidMiner\NPS_PaymentStatus\Result_file.res"/>
<operator name="MemoryCleanUp_START" class="MemoryCleanUp">
</operator>
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\RapidMiner\NPS_PaymentStatus\RapidMiner_PaymentStatus_MASTER_MinusNEUTRALS&BLANKS.xls"/>
<parameter key="first_row_as_names" value="true"/>
<parameter key="create_label" value="true"/>
<parameter key="create_id" value="true"/>
<parameter key="id_column" value="3"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="remove_original_attributes" value="true"/>
<parameter key="prune_below" value="10"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="StopwordFilterFile" class="StopwordFilterFile">
<parameter key="file" value="C:\RapidMiner\NPS_PaymentStatus\STOPWORDS.txt"/>
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
<operator name="LovinsStemmer" class="LovinsStemmer">
</operator>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE.dat"/>
<parameter key="attribute_description_file" value="C:\RapidMiner\NPS_PaymentStatus\ATTRIBUTE_DESCRIPTION_FILE.aml"/>
<parameter key="quote_nominal_values" value="false"/>
</operator>
<operator name="MemoryCleanUp_02" class="MemoryCleanUp">
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<parameter key="keep_example_set" value="true"/>
<parameter key="create_complete_model" value="true"/>
<parameter key="number_of_validations" value="2"/>
<operator name="LibSVMLearner" class="LibSVMLearner">
<parameter key="keep_example_set" value="true"/>
<parameter key="kernel_type" value="linear"/>
<parameter key="degree" value="1"/>
<list key="class_weights">
</list>
<parameter key="calculate_confidences" value="true"/>
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<parameter key="keep_model" value="true"/>
<list key="application_parameters">
</list>
<parameter key="create_view" value="true"/>
</operator>
<operator name="BinominalClassificationPerformance" class="BinominalClassificationPerformance">
<parameter key="keep_example_set" value="true"/>
<parameter key="main_criterion" value="AUC"/>
<parameter key="AUC" value="true"/>
<parameter key="precision" value="true"/>
<parameter key="recall" value="true"/>
<parameter key="lift" value="true"/>
<parameter key="fallout" value="true"/>
<parameter key="f_measure" value="true"/>
<parameter key="false_positive" value="true"/>
<parameter key="false_negative" value="true"/>
<parameter key="true_positive" value="true"/>
<parameter key="true_negative" value="true"/>
<parameter key="sensitivity" value="true"/>
<parameter key="specificity" value="true"/>
<parameter key="youden" value="true"/>
<parameter key="positive_predictive_value" value="true"/>
<parameter key="negative_predictive_value" value="true"/>
<parameter key="psep" value="true"/>
</operator>
<operator name="ECS_ModelResults" class="ExampleSetWriter">
<parameter key="example_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE_MODEL.dat"/>
<parameter key="format" value="special_format"/>
<parameter key="special_format" value="$i $l $p $d"/>
</operator>
<operator name="PerformanceWriter" class="PerformanceWriter">
<parameter key="performance_file" value="C:\RapidMiner\NPS_PaymentStatus\NPS_PaymentStatus.per"/>
</operator>
<operator name="ResultWriter" class="ResultWriter">
<parameter key="result_file" value="C:\RapidMiner\NPS_PaymentStatus\Result_file.res"/>
</operator>
</operator>
</operator>
<operator name="MemoryCleanUp_END" class="MemoryCleanUp">
</operator>
</operator>
I have noticed something peculiar -- when I sort the Excel file differently, the model that is produced is dramatically different. For example, if I sort on the label column, RapidMiner produces much better results than if I sort on ID. Should I always be sorting on the label column? I would have thought that RapidMiner would produce the same results on inputs sorted in any manner. Is this a bug? Can I rely on my results after seeing this behavior?
<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\RapidMiner\NPS_PaymentStatus\log.log"/>
<parameter key="resultfile" value="C:\RapidMiner\NPS_PaymentStatus\Result_file.res"/>
<operator name="MemoryCleanUp_START" class="MemoryCleanUp">
</operator>
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\RapidMiner\NPS_PaymentStatus\RapidMiner_PaymentStatus_MASTER_MinusNEUTRALS&BLANKS.xls"/>
<parameter key="first_row_as_names" value="true"/>
<parameter key="create_label" value="true"/>
<parameter key="create_id" value="true"/>
<parameter key="id_column" value="3"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="remove_original_attributes" value="true"/>
<parameter key="prune_below" value="10"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="StopwordFilterFile" class="StopwordFilterFile">
<parameter key="file" value="C:\RapidMiner\NPS_PaymentStatus\STOPWORDS.txt"/>
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
<operator name="LovinsStemmer" class="LovinsStemmer">
</operator>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE.dat"/>
<parameter key="attribute_description_file" value="C:\RapidMiner\NPS_PaymentStatus\ATTRIBUTE_DESCRIPTION_FILE.aml"/>
<parameter key="quote_nominal_values" value="false"/>
</operator>
<operator name="MemoryCleanUp_02" class="MemoryCleanUp">
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<parameter key="keep_example_set" value="true"/>
<parameter key="create_complete_model" value="true"/>
<parameter key="number_of_validations" value="2"/>
<operator name="LibSVMLearner" class="LibSVMLearner">
<parameter key="keep_example_set" value="true"/>
<parameter key="kernel_type" value="linear"/>
<parameter key="degree" value="1"/>
<list key="class_weights">
</list>
<parameter key="calculate_confidences" value="true"/>
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<parameter key="keep_model" value="true"/>
<list key="application_parameters">
</list>
<parameter key="create_view" value="true"/>
</operator>
<operator name="BinominalClassificationPerformance" class="BinominalClassificationPerformance">
<parameter key="keep_example_set" value="true"/>
<parameter key="main_criterion" value="AUC"/>
<parameter key="AUC" value="true"/>
<parameter key="precision" value="true"/>
<parameter key="recall" value="true"/>
<parameter key="lift" value="true"/>
<parameter key="fallout" value="true"/>
<parameter key="f_measure" value="true"/>
<parameter key="false_positive" value="true"/>
<parameter key="false_negative" value="true"/>
<parameter key="true_positive" value="true"/>
<parameter key="true_negative" value="true"/>
<parameter key="sensitivity" value="true"/>
<parameter key="specificity" value="true"/>
<parameter key="youden" value="true"/>
<parameter key="positive_predictive_value" value="true"/>
<parameter key="negative_predictive_value" value="true"/>
<parameter key="psep" value="true"/>
</operator>
<operator name="ECS_ModelResults" class="ExampleSetWriter">
<parameter key="example_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE_MODEL.dat"/>
<parameter key="format" value="special_format"/>
<parameter key="special_format" value="$i $l $p $d"/>
</operator>
<operator name="PerformanceWriter" class="PerformanceWriter">
<parameter key="performance_file" value="C:\RapidMiner\NPS_PaymentStatus\NPS_PaymentStatus.per"/>
</operator>
<operator name="ResultWriter" class="ResultWriter">
<parameter key="result_file" value="C:\RapidMiner\NPS_PaymentStatus\Result_file.res"/>
</operator>
</operator>
</operator>
<operator name="MemoryCleanUp_END" class="MemoryCleanUp">
</operator>
</operator>
0
Answers
-
Hi,
your process in general looks good to me (at least from viewing at the XML code alone )
Not necessarily. This completely depends on the learning scheme. However, with a 2-fold cross validation alone you can probably not really take any definite statement about the performance of the models. If the dramatic change in prediction performance still is true for a 10 times 10-fold cross validation I would be more worried
I would have thought that RapidMiner would produce the same results on inputs sorted in any manner.
Cheers,
Ingo0 -
THX!0