How to calculate tf/idf for a corpus of text files

bahamini · August 2012

Hi folk

I want to calculate tf/idf for a corpus which containes several text files. I employed "Process Document from Files" but I couldn't get any output. I expect to have a long list of terms in one column plus the t/idf weghts in the second column.
Anyone knows what is wrong with this process?

Thank you in advance.
Best

Nils_Woehler · August 2012

Hi,

you need to tokenize to retrieve meaningful results:



<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.009">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.009" expanded="true" name="Process">
    <process expanded="true" height="509" width="960">
      <operator activated="true" class="text:create_document" compatibility="5.2.005" expanded="true" height="60" name="Create Document" width="90" x="78" y="171">
        <parameter key="text" value="a b a b c d e f g"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.2.005" expanded="true" height="94" name="Process Documents" width="90" x="313" y="165">
        <process expanded="true" height="509" width="960">
          <operator activated="true" class="text:tokenize" compatibility="5.2.005" expanded="true" height="60" name="Tokenize" width="90" x="248" y="28"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Best,
Nils

bahamini · August 2012

Hi Nils

I have already done your approach as follows:
Read Documents from file->Tokenize->Stem Filter->Stop Word Filter->Process Document
Unfortunately, the output is quite empty. What's wrong with this plan?

Best

Nils_Woehler · August 2012

Hi,

can you please post your process setup as it is described here: http://rapid-i.com/rapidforum/index.php/topic,5226.0.html
This helps to find errors in your setup without too much guessing.

Best,
Nils

bahamini · August 2012

Hi Dear Nils

Actually, I followed two schemes as follows:
Scheme 1, which uses only "Process Document from Files":

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="371" width="772">
<operator activated="true" class="text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Files" width="90" x="246" y="75">
<list key="text_directories"/>
<process expanded="true">
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

Scheme2: step by step term extraction:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="371" width="772">
<operator activated="true" class="text:read_document" compatibility="5.2.004" expanded="true" height="60" name="Read Document" width="90" x="77" y="89">
<parameter key="file" value="C:\Users\User\Documents\11\Shansuddin_NNW2007.txt"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="211" y="86"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="380" y="75"/>
<operator activated="true" class="text:stem_snowball" compatibility="5.2.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="514" y="75"/>
<operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="648" y="210">
<process expanded="true">
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
</process>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

Andrew2 · August 2012

Hello

Two things

Firstly, inside the "Process Documents" operator, connect the input and output together.

Secondly, tf-idf will produce 0 if there is only one document to process by definition so change it to term occurrences to count the number of tokens.

Here is your process modified.


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="371" width="772">
      <operator activated="true" class="text:read_document" compatibility="5.2.004" expanded="true" height="60" name="Read Document" width="90" x="77" y="89">
        <parameter key="file" value="C:\Users\User\Documents\11\Shansuddin_NNW2007.txt"/>
      </operator>
      <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="211" y="86"/>
      <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="380" y="75"/>
      <operator activated="true" class="text:stem_snowball" compatibility="5.2.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="514" y="75"/>
      <operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="648" y="210">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <process expanded="true" height="959" width="1169">
          <connect from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
      <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
      <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
      <connect from_op="Stem (Snowball)" from_port="document" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

regards

Andrew

How to calculate tf/idf for a corpus of text files

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories