How to calculate tf/idf for a corpus of text files

bahamini
bahamini New Altair Community Member
edited November 5 in Community Q&A
Hi folk

I want to calculate tf/idf for a corpus which containes several text files. I employed "Process Document from Files" but I couldn't get any output. I expect to have a long list of terms in one column plus the t/idf weghts in the second column.
Anyone knows what is wrong with this process?


Thank you in advance.
Best

Answers

  • Nils_Woehler
    Nils_Woehler New Altair Community Member
    Hi,

    you need to tokenize to retrieve meaningful results:


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.009">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.009" expanded="true" name="Process">
        <process expanded="true" height="509" width="960">
          <operator activated="true" class="text:create_document" compatibility="5.2.005" expanded="true" height="60" name="Create Document" width="90" x="78" y="171">
            <parameter key="text" value="a b a b c d e f g"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.2.005" expanded="true" height="94" name="Process Documents" width="90" x="313" y="165">
            <process expanded="true" height="509" width="960">
              <operator activated="true" class="text:tokenize" compatibility="5.2.005" expanded="true" height="60" name="Tokenize" width="90" x="248" y="28"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Best,
    Nils
  • bahamini
    bahamini New Altair Community Member
    Hi Nils

    I have already done your approach as follows:
    Read Documents from file->Tokenize->Stem Filter->Stop Word Filter->Process Document
    Unfortunately, the output is quite empty. What's wrong with this plan?

    Best
  • Nils_Woehler
    Nils_Woehler New Altair Community Member
    Hi,

    can you please post your process setup as it is described here: http://rapid-i.com/rapidforum/index.php/topic,5226.0.html
    This helps to find errors in your setup without too much guessing.

    Best,
    Nils
  • bahamini
    bahamini New Altair Community Member
    Hi Dear Nils

    Actually, I followed two schemes as follows:
    Scheme 1, which uses only "Process Document from Files":

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="371" width="772">
          <operator activated="true" class="text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Files" width="90" x="246" y="75">
            <list key="text_directories"/>
            <process expanded="true">
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>


    Scheme2: step by step term extraction:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="371" width="772">
          <operator activated="true" class="text:read_document" compatibility="5.2.004" expanded="true" height="60" name="Read Document" width="90" x="77" y="89">
            <parameter key="file" value="C:\Users\User\Documents\11\Shansuddin_NNW2007.txt"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="211" y="86"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="380" y="75"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.2.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="514" y="75"/>
          <operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="648" y="210">
            <process expanded="true">
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
          <connect from_op="Stem (Snowball)" from_port="document" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • Hello

    Two things

    Firstly, inside the "Process Documents" operator, connect the input and output together.

    Secondly, tf-idf will produce 0 if there is only one document to process by definition so change it to term occurrences to count the number of tokens.

    Here is your process modified.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="371" width="772">
          <operator activated="true" class="text:read_document" compatibility="5.2.004" expanded="true" height="60" name="Read Document" width="90" x="77" y="89">
            <parameter key="file" value="C:\Users\User\Documents\11\Shansuddin_NNW2007.txt"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="211" y="86"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="380" y="75"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.2.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="514" y="75"/>
          <operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="648" y="210">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <process expanded="true" height="959" width="1169">
              <connect from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
          <connect from_op="Stem (Snowball)" from_port="document" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>

    regards

    Andrew