Comparing a Document with Multiple Example Data
Hello,
I would like to first process one or more documents (tokenize, n-grams, etc. -> done) and then compare each document with several sample data lists. If there is a match/similarity, the name of the respective list should be matched to the original document. If the documents contain common tokens but do not agree with a list, then "Others" should be mapped additional. It should later be possible to trace which lists fit into a document. I imagine this to be similar to a sentiment analysis with a training model, except that besides positive and negative there are a lot of assignments. Unfortunately, I don't find an approach how to proceed.
I would appreciate your help :smileyhappy:
Find more posts tagged with
Hey @Nicson,
i think what you want to do is to tokenize / n_gram the reference data set and the normal data set the same way and afterwards use a cross distance operator with cosine similarity to find similar items.
Best,
Martin
Thank you for your answers.
Yes, I have a reference dataset in every list and I want to compare it with every actual document. I created a little visualization to illustrate my project.
The list "Documents" contains all documents, List_A - List_C are the reference lists, which should be checked for their similarity to the contents of the documents. It is also important that the reference data is not only single words but also word pairs (n_grams).
The second picture shows how I imagine the output of the data.
kind regards
hello @Nicson - welcome to the community. Helpful hint from moderator: attach your csv/xls files to your posts so the kind people helping you don't have to recreate them.
Scott
@sgenzer Thanks for your advice, I'll take it into account for future postings.
I have just been looking at the Cross Distance Operator and its tutorial process. What this operator does is understandable for me, but I have problems to apply it to my project. Assuming I have a single document that I want to compare with a word list, what should this process look like?
Hi,
Have a look at the attached process. This would be my first try. Another way could be to use the Dictionary Based Sentiment Learner and miss use it to check how many tokens of your list are in the text.
Cheers,
Martin
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="7.5.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="340">
<parameter key="text" value="Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."/>
<description align="center" color="transparent" colored="false" width="126">Document to Test</description>
</operator>
<operator activated="true" class="text:create_document" compatibility="7.5.000" expanded="true" height="68" name="Create Document (2)" width="90" x="45" y="85">
<parameter key="text" value="Lorem Ipsum Dolor AnotherTerm"/>
<description align="center" color="transparent" colored="false" width="126">List of Words</description>
</operator>
<operator activated="true" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents (2)" width="90" x="246" y="85">
<parameter key="vector_creation" value="Term Occurrences"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="179" y="34"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="238">
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="cross_distances" compatibility="8.0.001" expanded="true" height="103" name="Cross Distances" width="90" x="581" y="85">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Create Document (2)" from_port="output" to_op="Process Documents (2)" to_port="documents 1"/>
<connect from_op="Process Documents (2)" from_port="example set" to_op="Cross Distances" to_port="reference set"/>
<connect from_op="Process Documents (2)" from_port="word list" to_op="Process Documents" to_port="word list"/>
<connect from_op="Process Documents" from_port="example set" to_op="Cross Distances" to_port="request set"/>
<connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Hi @Nicson,
If I good understand what you want, here a starting point with a process with one wordlist and one document.
I create an attribute with the value :
- "wordlistname_documentname" if all the words of the wordlist are present in the document
- "wordlistname_documentname (others)" if only a part of the words of the wordlist are present in the document.
Here the process :
I think this process can be improved maybe with a Loop operator and/or Select Subprocess operator to generalize
it at N documents and N wordlists.
I hope it will be helpful.
Regards,
Lionel