Text Mining - Detect Outliers
Hi everybody,
I am fairly new to RM and I have encountered a problem I havent been able to solve for a couple of days now.
The data is about companies ( as ID) and parts of their audit opinion.
My dataset contains about 4000 examples with some attributes but i only select 2 of them for my further steps. The first attribute is an ID and the second attribute contains texts (see attachment). I was trying to detect outliers only based on the text (e.g. wierd words or so...)
I also have the XML code :
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve KAM_Texts Alles" width="90" x="45" y="34">
<parameter key="repository_entry" value="../Daten/KAM_Texts Alles"/>
</operator>
<operator activated="true" class="subprocess" compatibility="8.2.000" expanded="true" height="82" name="Select Filter Set role" width="90" x="246" y="34">
<process expanded="true">
<operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="ID|Verkettet"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="8.2.000" expanded="true" height="103" name="Filter Examples" width="90" x="246" y="34">
<list key="filters_list">
<parameter key="filters_entry_key" value="Verkettet.is_not_missing."/>
</list>
</operator>
<operator activated="true" class="set_role" compatibility="8.2.000" expanded="true" height="82" name="Set Role" width="90" x="380" y="34">
<parameter key="attribute_name" value="ID"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles">
<parameter key="Verkettet" value="regular"/>
<parameter key="ID" value="id"/>
</list>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="8.2.000" expanded="true" height="82" name="Nominal to Text" width="90" x="514" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Verkettet"/>
</operator>
<connect from_port="in 1" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="380" y="34">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="percentual"/>
<parameter key="prune_above_percent" value="70.0"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="112" y="34"/>
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34"/>
<operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="380" y="34">
<parameter key="min_chars" value="3"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="514" y="34"/>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="8.1.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="648" y="34">
<parameter key="file" value="E:\Studium\MASTER FACT\MASTERARBEIT\Dictonary filtering stopwords .txt"/>
</operator>
<operator activated="true" class="text:stem_porter" compatibility="8.1.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="782" y="34"/>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
<connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="detect_outlier_distances" compatibility="8.2.000" expanded="true" height="82" name="Detect Outlier (Distances)" width="90" x="648" y="34">
<parameter key="distance_function" value="cosine distance"/>
</operator>
<connect from_op="Retrieve KAM_Texts Alles" from_port="output" to_op="Select Filter Set role" to_port="in 1"/>
<connect from_op="Select Filter Set role" from_port="out 1" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Detect Outlier (Distances)" to_port="example set input"/>
<connect from_op="Process Documents from Data" from_port="word list" to_port="result 2"/>
<connect from_op="Detect Outlier (Distances)" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
Somehow as a result i get the first 10 examples shown as outliers, which I think is kind of wierd or wrong.
I am not even sure whether this is the right way to tackle my problem.
Anyway, I would really appreciate any kinds of hints or solutions.
Thank you.
Best
Flo
Answers
-
Hi,
I would recommend to use the LOF algorithm from anomaly detection extension. The Detect Outlier algorithm always flags the top10 (highest distances) as outliers.
BR,
Martin
2 -
Hi @flo
Interesting problem, though it is hard to generalize it without having the original data source with texts. I am though not sure that 'detect outliers' would work good on the data you supposedly will get out of your process.
Just a quick wild guess, have you tried using clustering algorithm on TF-IDF matrix instead (simplest k-Means for example, with a few different k values) and see if it suggests some clusters significantly smaller than others? My intuition tells me it might work, though need to try on real data. You may also try manually adding a few entries with really different content (like, for example, from totally different domain) and see if clustering algorithm separates them from others.
1 -
Thank you for your answers.
@mschmitz About the Detect Outlier algorithm I was just wondering that when I set the "number of outliers" to 10 it would give me the first 10 examples as outlier = true - and when I set the "number of outliers" to 15 it would give me the first 15 examples as outliers. Anyway the LOF algorithm was a good hint, however i think the k nn global anomaly score might be better suited based on the results.
@kypexin The clustering did help, however i feel there is always one example in any cluster that doesnt really fit with the other examples in cluster - i guess i have to play around more with the k values.
I think it helped me to get further with my problem so thank you very much.
0