How to compare similarity of large number of documents

Hello,

I'm looking for a way to find the similarities between a large number of documents to each other, i.e., similarity document A to B, similarity A to C, B to C, etc. I have been using the Text Mining extension.

The process I have been using is:
Retrieve > Nominal to Text > Data to Documents > Process documents (TF_IDF) (+Tokenize) > Data to Similarity (CosineSimilarity)

The documents are short, under 30 words.
There are about 1200 documents.

This works for a small number of documents, normally in 2-3 seconds. However, when I try to use it for all of the 1200 documents, RapidMIner says the process is completed in 0 seconds and then doesn't show any results. The bar on the bottom right remains frozen on "Creating Displays." Program stops working.

Does this happen because there are too many results for the operation? If so, what is the correct approach?

Help would be very much appreciated.

This is the full process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.014">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true" height="521" width="748">
<operator activated="true" class="retrieve" compatibility="5.1.014" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Repository1/Martyrs/Data/document similarity test data"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="5.1.014" expanded="true" height="76" name="Nominal to Text" width="90" x="112" y="120">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="C"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="5.1.004" expanded="true" height="60" name="Data to Documents" width="90" x="179" y="210">
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.1.004" expanded="true" height="94" name="Process Documents" width="90" x="246" y="300">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="TF-IDF"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="false"/>
<parameter key="prune_method" value="none"/>
<parameter key="prunde_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_rank" value="5.0"/>
<parameter key="prune_above_rank" value="5.0"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="parallelize_vector_creation" value="false"/>
<process expanded="true" height="610" width="980">
<operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="181" y="42">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="5.1.014" expanded="true" height="76" name="Data to Similarity" width="90" x="313" y="435">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Find more posts tagged with

AI Studio

Accepted answers

All comments

etharpe

Any ideas on this, anyone? I imagine the solution would be to create some kind of loop:
First, Rapidminer creates a compiled list of the tokens in all the documents
Second, based on that list, Rapidminer compares the similarity of document A to document B, then C, then D, ...
Third, Rapidminer compares similarity of document B to document C, then D, ...
Fourth, Rapidminer compares similarity of document C to document D, then E, ...

Problem is, I have no idea how to do this!

Eagerly awaiting your thoughts, and thank you.

Andrew2

Hello

If you input 1200 examples to the data to similarity operator you will get 1200*1199 pairs - 1.4 million rows - so you're probably getting memory issues. My suggestion is to use the similarity to data operator to turn the similarity result back into an example set and see if this displays more efficiently. If not, I would write the result to the repository, a database or a file and I would disconnect the result from the output so that it does not display at all.

You can then read the result later and use the filter or sample operators to extract the bits you're interested in.

regards

Andrew

etharpe

Yes, that's done it. Thank you very much.

iinnaanncc

awchisholm wrote:

Hello

If you input 1200 examples to the data to similarity operator you will get 1200*1199 pairs - 1.4 million rows - so you're probably getting memory issues. My suggestion is to use the similarity to data operator to turn the similarity result back into an example set and see if this displays more efficiently. If not, I would write the result to the repository, a database or a file and I would disconnect the result from the output so that it does not display at all.

You can then read the result later and use the filter or sample operators to extract the bits you're interested in.

regards

Andrew

Dear Andrew,

I am able to get Similarity results (which has 3 columns first, second, similarity) with small number of rows on RapidMiner. But when I want to get higher number of row as result of similarity, I get same problem which says Creating Displays and waits forever

As your solution, I want to store similarity results in an excel file or in a database. For example if I want to add an Write to Excel operator, it does not accept similarity as an input. How can export these similarty results into an excel file?

Andrew2

Hello

Use the "simillarity to data" operator to convert to an example set

regards

Andrew

iinnaanncc

Thanks!

maxfax

Even though this is kind of an old topic but my idea fits pretty well.

I would like to compare around 50000 different text cells from an Csv. i would like to find out which are the 5 most similiar data to the first text item.

As I understand the similaritytodata operator compares everything with everything but i would like to compare the first item to the rest.

Which other Operator can i use ?

THank you very much for your Help!

Andrew2

Hello

You could use the "cross distances" operator. It takes two example sets. The first would be the single item, the second would be examples to match against it. The result would be the distances between the single example and all the others.

regards

Andrew

roberto_r_herma

Hi, I found this entry because I faced the same issue. It takes forever to get the output of cosine similiarity analysis out of 4100 documents. I followed some of the suggestions above and my flow is:

Read CSV--> Process documents from Data-->Data to similarity--> Similarity to Data--> Write Excel

After 24 hours it is still in the "Similarity to Data" process.

Any one has an idea how much time this will take? My PC characteristics are as follow:

Windows 10 entreprise Version 1607, 64 bit

Processor Intel Core i5-4310U

CPU 2,60 GHZ

RAM (8GB)

Thanks for any tip

sgenzer

Hello @roberto_r_herma - so process time varies a lot depending on many factors including your machine, the size and scope of the documents, etc... One thing that I can definitely tell you is that RapidMiner loves RAM and multiple core processors. FWIW, I just upgraded to 64GB of RAM with my 6-core Intel Xeon E5 to keep things humming along.

If I were you, I'd use the Sample operator and grab a small sample of your documents first. Benchmark the sample and then gently increase so you can get a sense if the full 4100 docs is going to take 2 days or 2 years.

Scott

roberto_r_herma

Thanks for the tip!