"Text Processing - How to track which are the exact documents contain the word?"
Tan_Koon_Chin
New Altair Community Member
Hi all,
I have processed the TEXT MINING operators and obtained the ExampleSet (WordList to Data) & WordList (Process Documents From Files). Number of occurrence for words has been shown in the result too. How about if I wish to determine the words in result belong to which documents?
Example: The word "apple" appears 100 times in 80 documents. How to track and determine which are the exact documents contain the word "apple"? What am I missing here? Any solution for it?
Thanks in advance.
Regards.
I have processed the TEXT MINING operators and obtained the ExampleSet (WordList to Data) & WordList (Process Documents From Files). Number of occurrence for words has been shown in the result too. How about if I wish to determine the words in result belong to which documents?
Example: The word "apple" appears 100 times in 80 documents. How to track and determine which are the exact documents contain the word "apple"? What am I missing here? Any solution for it?
Thanks in advance.
Regards.
Tagged:
0
Answers
-
Hello
Take a look at the following process. The example set output contains labels corresponding to the document and by using term occurrences when processing the documents, you can see the word counts for each document.<?xml version="1.0" encoding="UTF-8" standalone="no"?>
regards
<process version="6.0.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document" width="90" x="112" y="165">
<parameter key="text" value="apple banana lemon peach strawberry raspberry apple cherry melon"/>
<parameter key="add label" value="true"/>
<parameter key="label_value" value="doc1"/>
</operator>
<operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document (2)" width="90" x="112" y="255">
<parameter key="text" value=" banana lemon peach strawberry raspberry cherry melon"/>
<parameter key="add label" value="true"/>
<parameter key="label_value" value="doc2"/>
</operator>
<operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document (3)" width="90" x="112" y="390">
<parameter key="text" value="apple banana lemon peach strawberry raspberry apple cherry melon apple banana lemon peach strawberry raspberry apple cherry melon"/>
<parameter key="add label" value="true"/>
<parameter key="label_value" value="doc3"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="130" name="Process Documents" width="90" x="380" y="165">
<parameter key="vector_creation" value="Term Occurrences"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Create Document (2)" from_port="output" to_op="Process Documents" to_port="documents 2"/>
<connect from_op="Create Document (3)" from_port="output" to_op="Process Documents" to_port="documents 3"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Andrew0 -
Thank you for the concern.
How about if multiple documents have been processed?
(If just a few documents can use "Create Document" operator and label each of them)
For example, the result of WordList shown is as below:
Word Total Occurrence In Documents
Apple 200 180
Orange 150 130
Strawberry 90 50
The result reveals that "Apple" appears 200 times in 180 documents.
Is there any method to know that which are those 180 documents from the analysis result? (E.g. Doc. 10, Doc. 16, Doc. 45)
Regards,
Tan0 -
If you are using the "Process Document from Files" operator, the file name for the document will appear in the output example set if the option "add meta information" is set to true. The attribute name is metadata_file.
Andrew0 -
Thanks Andrew for the solution !!
Best Regards.0