Can process documents calculate term occurences of all words without having to give it a word list?

JeremyMTMD
JeremyMTMD New Altair Community Member
edited November 5 in Community Q&A
I want process document to calculate for ALL the words in the document I send him, but I don't want to have to right them all manually. If someone has a solution, I would gladly take it!

Answers

  • Caperez
    Caperez Altair Community Member
    Hi @JeremyMTMD

    There is an Text mining extenssion into the Marketplace for that, named Text Processing.

    Into the Rapidminer Academy you have good learning materials to learn about it.

    https://academy.rapidminer.com/learn/course/text-and-web-mining-with-rapidminer/text-and-web-mining/lets-get-started

    Best, 


    Cesar
  • JeremyMTMD
    JeremyMTMD New Altair Community Member
    Hi @ceaperez,

    I'm already using this extension. My problem is that I can't seem to be able to use the operator process document to calculate term occurences. It tells me I need a wordlist, but I don't know how to create one or where to search for one. 

    I would like an operator who can just calculate the term frequency of a tokenized, stemmed et filtered text so I can see which words is present the most. If someone knows of a way to do something like this, I would like to learn about it!

    Thanks! 

    Jérémy


  • BalazsBarany
    BalazsBarany New Altair Community Member
    Hi!

    You get a wordlist from the Process Documents ... operators. 

    See the attached example:
    <?xml version="1.0" encoding="UTF-8"?><process version="9.10.013">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.10.013" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="-1"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.10.013" expanded="true" height="68" name="Retrieve JobPosts" width="90" x="112" y="85">
            <parameter key="repository_entry" value="//Training Resources/Utilities/Text Mining/data/JobPosts"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="9.10.013" expanded="true" height="82" name="Nominal to Text" width="90" x="246" y="85">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="JobText"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="9.4.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
            <parameter key="create_word_vector" value="true"/>
            <parameter key="vector_creation" value="TF-IDF"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="keep_text" value="false"/>
            <parameter key="prune_method" value="none"/>
            <parameter key="prune_below_percent" value="3.0"/>
            <parameter key="prune_above_percent" value="30.0"/>
            <parameter key="prune_below_rank" value="0.05"/>
            <parameter key="prune_above_rank" value="0.95"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
            <parameter key="select_attributes_and_weights" value="false"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="9.4.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
                <parameter key="mode" value="non letters"/>
                <parameter key="characters" value=".:"/>
                <parameter key="language" value="English"/>
                <parameter key="max_token_length" value="3"/>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve JobPosts" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents from Data" from_port="word list" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>


    The "wor" output of Process Documents is a wordlist. It is a special data structure, you can for example store it and apply it on future Process Documents operations. If you just need the data (word occurences), use WordList to Data.

    Regards,
    Balázs