Data Extraction

kaalgota
kaalgota New Altair Community Member
edited November 5 in Community Q&A
Hi,

I am reading tweets and from that I want to extract top 10 words with high occurrences. Can you let me know how to extract top k words?

Thanks,
Kalpana
Tagged:

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hey Kalpana,

    do you know our video tutorials? There are also some good ones about text mining. You'll find the link on our website at http://rapid-i.com , or read the post linked in my signature and click on the link in the first item there.

    I'll give you some keywords which you'll understand after watching the videos: use the Process Documents operator to process your documents, convert the output Wordlist to an Example Set, the Sort and Filter it. See the attached process for an example, and please be sure to adapt the subprocess of Process Documents to your needs.

    Best, Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.007">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.007" expanded="true" name="Process">
        <process expanded="true" height="558" width="634">
          <operator activated="true" class="generate_nominal_data" compatibility="5.2.007" expanded="true" height="60" name="Generate Nominal Data" width="90" x="45" y="30">
            <parameter key="number_examples" value="1000"/>
            <parameter key="number_of_attributes" value="1"/>
            <parameter key="number_of_values" value="50"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="5.2.007" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="30"/>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.2.001" expanded="true" height="76" name="Process Documents from Data" width="90" x="380" y="30">
            <list key="specify_weights"/>
            <process expanded="true" height="576" width="806">
              <connect from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:wordlist_to_data" compatibility="5.2.001" expanded="true" height="76" name="WordList to Data" width="90" x="514" y="30"/>
          <operator activated="true" class="sort" compatibility="5.2.007" expanded="true" height="76" name="Sort" width="90" x="112" y="210">
            <parameter key="attribute_name" value="total"/>
            <parameter key="sorting_direction" value="decreasing"/>
          </operator>
          <operator activated="true" class="filter_example_range" compatibility="5.2.007" expanded="true" height="76" name="Filter Example Range" width="90" x="246" y="210">
            <parameter key="first_example" value="1"/>
            <parameter key="last_example" value="10"/>
          </operator>
          <connect from_op="Generate Nominal Data" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
          <connect from_op="WordList to Data" from_port="example set" to_op="Sort" to_port="example set input"/>
          <connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
          <connect from_op="Filter Example Range" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="162"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>