Community & Support
Learn
Marketplace
Discussions
Categories
Discussions
General
Platform
Academic
Partner
Regional
Explore Siemens Communities
User Groups
Documentation
Events
Altair Exchange
Share or Download Projects
Resources
News & Instructions
Programs
YouTube
Employee Resources
This tab can be seen by employees only. Please do not share these resources externally.
Groups
Join a User Group
Support
Home
Discussions
Community Q&A
"TextMining using LibSVMLearner -- does sort order of Excel input file matter?"
wotsiznamiz
I am using the following code to text-mine a ~10,000 row Excel Record Set. The Excel file has three columns: (1) the label, (2) the text, and (3) the ID.
I have noticed something peculiar -- when I sort the Excel file differently, the model that is produced is dramatically different. For example, if I sort on the label column, RapidMiner produces much better results than if I sort on ID. Should I always be sorting on the label column? I would have thought that RapidMiner would produce the same results on inputs sorted in any manner. Is this a bug? Can I rely on my results after seeing this behavior?
<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\RapidMiner\NPS_PaymentStatus\log.log"/>
<parameter key="resultfile" value="C:\RapidMiner\NPS_PaymentStatus\Result_file.res"/>
<operator name="MemoryCleanUp_START" class="MemoryCleanUp">
</operator>
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\RapidMiner\NPS_PaymentStatus\RapidMiner_PaymentStatus_MASTER_MinusNEUTRALS&BLANKS.xls"/>
<parameter key="first_row_as_names" value="true"/>
<parameter key="create_label" value="true"/>
<parameter key="create_id" value="true"/>
<parameter key="id_column" value="3"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="remove_original_attributes" value="true"/>
<parameter key="prune_below" value="10"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="StopwordFilterFile" class="StopwordFilterFile">
<parameter key="file" value="C:\RapidMiner\NPS_PaymentStatus\STOPWORDS.txt"/>
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
<operator name="LovinsStemmer" class="LovinsStemmer">
</operator>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE.dat"/>
<parameter key="attribute_description_file" value="C:\RapidMiner\NPS_PaymentStatus\ATTRIBUTE_DESCRIPTION_FILE.aml"/>
<parameter key="quote_nominal_values" value="false"/>
</operator>
<operator name="MemoryCleanUp_02" class="MemoryCleanUp">
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<parameter key="keep_example_set" value="true"/>
<parameter key="create_complete_model" value="true"/>
<parameter key="number_of_validations" value="2"/>
<operator name="LibSVMLearner" class="LibSVMLearner">
<parameter key="keep_example_set" value="true"/>
<parameter key="kernel_type" value="linear"/>
<parameter key="degree" value="1"/>
<list key="class_weights">
</list>
<parameter key="calculate_confidences" value="true"/>
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<parameter key="keep_model" value="true"/>
<list key="application_parameters">
</list>
<parameter key="create_view" value="true"/>
</operator>
<operator name="BinominalClassificationPerformance" class="BinominalClassificationPerformance">
<parameter key="keep_example_set" value="true"/>
<parameter key="main_criterion" value="AUC"/>
<parameter key="AUC" value="true"/>
<parameter key="precision" value="true"/>
<parameter key="recall" value="true"/>
<parameter key="lift" value="true"/>
<parameter key="fallout" value="true"/>
<parameter key="f_measure" value="true"/>
<parameter key="false_positive" value="true"/>
<parameter key="false_negative" value="true"/>
<parameter key="true_positive" value="true"/>
<parameter key="true_negative" value="true"/>
<parameter key="sensitivity" value="true"/>
<parameter key="specificity" value="true"/>
<parameter key="youden" value="true"/>
<parameter key="positive_predictive_value" value="true"/>
<parameter key="negative_predictive_value" value="true"/>
<parameter key="psep" value="true"/>
</operator>
<operator name="ECS_ModelResults" class="ExampleSetWriter">
<parameter key="example_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE_MODEL.dat"/>
<parameter key="format" value="special_format"/>
<parameter key="special_format" value="$i $l $p $d"/>
</operator>
<operator name="PerformanceWriter" class="PerformanceWriter">
<parameter key="performance_file" value="C:\RapidMiner\NPS_PaymentStatus\NPS_PaymentStatus.per"/>
</operator>
<operator name="ResultWriter" class="ResultWriter">
<parameter key="result_file" value="C:\RapidMiner\NPS_PaymentStatus\Result_file.res"/>
</operator>
</operator>
</operator>
<operator name="MemoryCleanUp_END" class="MemoryCleanUp">
</operator>
</operator>
Find more posts tagged with
AI Studio
Excel
Accepted answers
All comments
IngoRM
Hi,
your process in general looks good to me (at least from viewing at the XML code alone
)
I would have thought that RapidMiner would produce the same results on inputs sorted in any manner.
Not necessarily. This completely depends on the learning scheme. However, with a 2-fold cross validation alone you can probably not really take any definite statement about the performance of the models. If the dramatic change in prediction performance still is true for a 10 times 10-fold cross validation I would be more worried
Cheers,
Ingo
wotsiznamiz
THX!
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups