How to calculate tf/idf for a corpus of text files
New Altair Community Member
Hi folk
I want to calculate tf/idf for a corpus which containes several text files. I employed "Process Document from Files" but I couldn't get any output. I expect to have a long list of terms in one column plus the t/idf weghts in the second column.
Anyone knows what is wrong with this process?
Thank you in advance.
I want to calculate tf/idf for a corpus which containes several text files. I employed "Process Document from Files" but I couldn't get any output. I expect to have a long list of terms in one column plus the t/idf weghts in the second column.
Anyone knows what is wrong with this process?
Thank you in advance.
you need to tokenize to retrieve meaningful results:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.009">
<operator activated="true" class="process" compatibility="5.2.009" expanded="true" name="Process">
<process expanded="true" height="509" width="960">
<operator activated="true" class="text:create_document" compatibility="5.2.005" expanded="true" height="60" name="Create Document" width="90" x="78" y="171">
<parameter key="text" value="a b a b c d e f g"/>
<operator activated="true" class="text:process_documents" compatibility="5.2.005" expanded="true" height="94" name="Process Documents" width="90" x="313" y="165">
<process expanded="true" height="509" width="960">
<operator activated="true" class="text:tokenize" compatibility="5.2.005" expanded="true" height="60" name="Tokenize" width="90" x="248" y="28"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
Nils0 -
Hi Nils
I have already done your approach as follows:
Read Documents from file->Tokenize->Stem Filter->Stop Word Filter->Process Document
Unfortunately, the output is quite empty. What's wrong with this plan?
Best0 -
can you please post your process setup as it is described here:,5226.0.html
This helps to find errors in your setup without too much guessing.
Nils0 -
Hi Dear Nils
Actually, I followed two schemes as follows:
Scheme 1, which uses only "Process Document from Files":
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="371" width="772">
<operator activated="true" class="text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Files" width="90" x="246" y="75">
<list key="text_directories"/>
<process expanded="true">
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
Scheme2: step by step term extraction:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="371" width="772">
<operator activated="true" class="text:read_document" compatibility="5.2.004" expanded="true" height="60" name="Read Document" width="90" x="77" y="89">
<parameter key="file" value="C:\Users\User\Documents\11\Shansuddin_NNW2007.txt"/>
<operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="211" y="86"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="380" y="75"/>
<operator activated="true" class="text:stem_snowball" compatibility="5.2.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="514" y="75"/>
<operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="648" y="210">
<process expanded="true">
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
0 -
Two things
Firstly, inside the "Process Documents" operator, connect the input and output together.
Secondly, tf-idf will produce 0 if there is only one document to process by definition so change it to term occurrences to count the number of tokens.
Here is your process modified.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="371" width="772">
<operator activated="true" class="text:read_document" compatibility="5.2.004" expanded="true" height="60" name="Read Document" width="90" x="77" y="89">
<parameter key="file" value="C:\Users\User\Documents\11\Shansuddin_NNW2007.txt"/>
<operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="211" y="86"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="380" y="75"/>
<operator activated="true" class="text:stem_snowball" compatibility="5.2.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="514" y="75"/>
<operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="648" y="210">
<parameter key="vector_creation" value="Term Occurrences"/>
<process expanded="true" height="959" width="1169">
<connect from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
<connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>