How to calculate tf/idf for a corpus of text files

bahamini
bahamini New Altair Community Member
edited November 2024 in Community Q&A
Hi folk

I want to calculate tf/idf for a corpus which containes several text files. I employed "Process Document from Files" but I couldn't get any output. I expect to have a long list of terms in one column plus the t/idf weghts in the second column.
Anyone knows what is wrong with this process?


Thank you in advance.
Best

Welcome!

It looks like you're new here. Sign in or register to get started.

Answers

  • Nils_Woehler
    Nils_Woehler New Altair Community Member
    Hi,

    you need to tokenize to retrieve meaningful results:


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.009">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.009" expanded="true" name="Process">
        <process expanded="true" height="509" width="960">
          <operator activated="true" class="text:create_document" compatibility="5.2.005" expanded="true" height="60" name="Create Document" width="90" x="78" y="171">
            <parameter key="text" value="a b a b c d e f g"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.2.005" expanded="true" height="94" name="Process Documents" width="90" x="313" y="165">
            <process expanded="true" height="509" width="960">
              <operator activated="true" class="text:tokenize" compatibility="5.2.005" expanded="true" height="60" name="Tokenize" width="90" x="248" y="28"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Best,
    Nils
  • bahamini
    bahamini New Altair Community Member
    Hi Nils

    I have already done your approach as follows:
    Read Documents from file->Tokenize->Stem Filter->Stop Word Filter->Process Document
    Unfortunately, the output is quite empty. What's wrong with this plan?

    Best
  • Nils_Woehler
    Nils_Woehler New Altair Community Member
    Hi,

    can you please post your process setup as it is described here: http://rapid-i.com/rapidforum/index.php/topic,5226.0.html
    This helps to find errors in your setup without too much guessing.

    Best,
    Nils
  • bahamini
    bahamini New Altair Community Member
    Hi Dear Nils

    Actually, I followed two schemes as follows:
    Scheme 1, which uses only "Process Document from Files":

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="371" width="772">
          <operator activated="true" class="text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Files" width="90" x="246" y="75">
            <list key="text_directories"/>
            <process expanded="true">
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>


    Scheme2: step by step term extraction:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="371" width="772">
          <operator activated="true" class="text:read_document" compatibility="5.2.004" expanded="true" height="60" name="Read Document" width="90" x="77" y="89">
            <parameter key="file" value="C:\Users\User\Documents\11\Shansuddin_NNW2007.txt"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="211" y="86"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="380" y="75"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.2.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="514" y="75"/>
          <operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="648" y="210">
            <process expanded="true">
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
          <connect from_op="Stem (Snowball)" from_port="document" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • Andrew2
    Andrew2 New Altair Community Member
    Hello

    Two things

    Firstly, inside the "Process Documents" operator, connect the input and output together.

    Secondly, tf-idf will produce 0 if there is only one document to process by definition so change it to term occurrences to count the number of tokens.

    Here is your process modified.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="371" width="772">
          <operator activated="true" class="text:read_document" compatibility="5.2.004" expanded="true" height="60" name="Read Document" width="90" x="77" y="89">
            <parameter key="file" value="C:\Users\User\Documents\11\Shansuddin_NNW2007.txt"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="211" y="86"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="380" y="75"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.2.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="514" y="75"/>
          <operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="648" y="210">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <process expanded="true" height="959" width="1169">
              <connect from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
          <connect from_op="Stem (Snowball)" from_port="document" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>

    regards

    Andrew

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.