"Clustering and similarity of the text documents"

zacev
zacev New Altair Community Member
edited November 5 in Community Q&A

Hello,

I have been recently dealing with some extraction methods of the keyphrases from the text. Now I would like to solve another problem: Clustering the documents& similarity between them.

It goes like that: Let us suppose that we have some security documents from various sources. I would like to examine these documents and cluster them. Sometimes a document can be published from various sources about the same topic/device/problem. The goal is to find these 'overlapping' documents and put the in one cluster. Published documents have the following features: the structure may be changed, some words may be added, but the key phrases are the same, mainly a number that identifies a report or other key phrases, that appear repeatedly. Any suggestions about the model? I've tried to use several clustering parameters and metrics, but the results are rather not good. The approach based on frequency of common words would fail, because of the specific structure of the documents. Thanks in advance for any suggestions.

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Dear Zacev,

     

    as a first question: Is it possible to make this a supervised problem by having annotated data? That would make life way easier.

     

    ~Martin

  • zacev
    zacev New Altair Community Member

    Would you like me to provide samples of documents that I am working with or the process? I'm not sure If I understood correctly.

  • zacev
    zacev New Altair Community Member
    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:process_document_from_file" compatibility="7.2.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="380" y="136">
    <list key="text_directories">
    <parameter key="Dokumenty1" value="C:\Users\John\Desktop\experyment1"/>
    </list>
    <parameter key="vector_creation" value="Term Frequency"/>
    <parameter key="prune_method" value="percentual"/>
    <parameter key="prune_below_percent" value="20.0"/>
    <parameter key="prune_above_percent" value="100.0"/>
    <parameter key="datamanagement" value="float_sparse_array"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.2.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="187">
    <parameter key="mode" value="linguistic sentences"/>
    </operator>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="7.2.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="246" y="187"/>
    <operator activated="true" class="text:transform_cases" compatibility="7.2.000" expanded="true" height="68" name="Transform Cases" width="90" x="380" y="187"/>
    <operator activated="true" class="text:generate_n_grams_terms" compatibility="7.2.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="514" y="187"/>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
    <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="340">
    <parameter key="attribute_filter_type" value="value_type"/>
    <parameter key="value_type" value="numeric"/>
    </operator>
    <operator activated="true" class="fast_k_means" compatibility="7.2.000" expanded="true" height="82" name="Clustering (2)" width="90" x="581" y="442">
    <parameter key="k" value="3"/>
    <parameter key="max_optimization_steps" value="10"/>
    </operator>
    <operator activated="false" class="k_means" compatibility="7.2.000" expanded="true" height="82" name="Clustering" width="90" x="581" y="595"/>
    <connect from_op="Process Documents from Files" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Clustering (2)" to_port="example set"/>
    <connect from_op="Select Attributes" from_port="original" to_port="result 1"/>
    <connect from_op="Clustering (2)" from_port="cluster model" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>

    I have uploaded the full process. So far I have taken 6 documents from three different sources. Successfully Clustering put these document into 3 different clusters, so all the documents from one source belong to the same cluster. Now, as I wrote, I would like to sort these documents in clusters, so they would be clustered upon some keywords or ID numbers in the same cluster - if two documents consider the same device name, they should be put together (doesn't matter from which source).