"Text Mining term occurrences per label value"

arno
arno New Altair Community Member
edited November 5 in Community Q&A
Hi everyone!

I started out using Rapidminer for text mining as it seems a pretty powerful tool to do so.

When using the "Process documents from data" operator I get an output called WordList which gives an overview of the different
words in the documents and a frequency of occurrence. I also set a label on the dataset and the table also shows the values
of this label as different categories for which it should give you term occurrence frequencies. However while
"document occurences" and "Total occurence" seem to be calculated correctly for every word, all the different categories just show 0 for every word.

I would expect a word like let's say "sponsor" which occurs in 10 documents to be distributed over the different categories since every document was classified
in a category.

Did I do something wrong in the data import process? Are there prerequisites I do not know about so the division of word occurrences would be shown correctly over all the values of the label
variable?

thanks in advance,

Arno

Answers

  • Skirzynski
    Skirzynski New Altair Community Member
    Hey Arno,

    Either I do not get it or I simply cannot reproduce this. Could you please provide a minimal example which can be reproduced, i.e. the process and a small set of data which will be loaded. You can use the code-tags to paste the XML of the process and the data in CSV format for instance.

    Cheers
      Marcin
  • arno
    arno New Altair Community Member
    This is the process flow in XML. it's actually nothing more than an import and some data cleaning (tokenisation, stemming, stopword filtering and n-gram creation) inside a "Process documents from data" which generates the word list.

    How do I add data to a forum post?and a screenshot? cause i get the image tags but can't upload an image?  :)

    The data is pretty simple though: just 1 variable filled with text and the other variable is a sentiment label (neutral, positive, negative).

    data is like this:

    content sentiment
    Limburg: Centrum Helchteren op de parking van de Carrefour zijn ze zich aan het opstellen neutral
    Limburg: Centrum Helchteren op de parking van de Carrefour zijn ze zich aan het opstellen neutral
    Ging goed het inwerken bij de Albert Heijn! Wel beetje chaotisch in het begin neutral
    <div class="socialmention">STABROEK: DAAR MAG JE GOOIEN MET EIEREN! :p</div> positive
    Limburg: Centrum Helchteren op de parking van de Carrefour zijn ze zich aan het...  - @FlitscontroleBE neutral
    Al carrefour. Mi nenea te amo neutral
    @Beauux1 moet je nog lidl? neutral
    RT @X_xAE: HahahahahahahhhH ik kom @Y0UKN0WITBR0. Altijd in van die coole winkels tegen hahahaha zeeman lidl allussss :) positive
    @sbfotos @hema Albert Heijn heeft een vergelijkbaar product! neutral
    whaha , jochem viel net in de lidl qqqq. hij lag zo mooi op de grondd, gehehe. positive
    @NeaKurt gewoon bij albert heijn..... en wat denk je cvandaag met 35 procent korting.... neutral
    @CRAZYKiiiD__ @catilin99 ik ga gewoon naar appie (albert Heijn) in de stad of gewoon Deka of c1000 bij mij in de buurt neutral
    I'm at Delhaize (Schaarbeek / Schaerbeek, Brussels) <a target="_blank" class="readability" href="http://4sq.com/11KYg1c" title="http://4sq.com/11KYg1c">4sq.com/11KYg1c</a> neutral
    Redactie Foodlog: Is de bakker te dom of de Lidl te groot? - @foodlog_nl - <a target="_blank" class="readability" href="http://bit.ly/UnUarr" title="http://bit.ly/UnUarr">bit.ly/UnUarr</a> neutral
    Naar albert heijn daarna naar huis en eindelijk eten. neutral
    code:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
       <process expanded="true" height="521" width="748">
         <operator activated="true" class="read_excel" compatibility="5.2.008" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
           <parameter key="excel_file" value="H:\text data\engagor\concurrentie\textsample.xls"/>
           <parameter key="imported_cell_range" value="A1:B195"/>
           <parameter key="first_row_as_names" value="false"/>
           <list key="annotations">
             <parameter key="0" value="Name"/>
           </list>
           <list key="data_set_meta_data_information">
             <parameter key="0" value="content.true.text.attribute"/>
             <parameter key="1" value="sentiment.true.polynominal.label"/>
           </list>
         </operator>
         <operator activated="true" class="text:process_document_from_data" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="75">
           <parameter key="keep_text" value="true"/>
           <list key="specify_weights"/>
           <process expanded="true" height="644" width="785">
             <operator activated="true" class="web:extract_html_text_content" compatibility="5.2.001" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30"/>
             <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="179" y="30">
               <parameter key="characters" value="' '"/>
             </operator>
             <operator activated="true" class="text:filter_by_length" compatibility="5.2.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="313" y="30">
               <parameter key="min_chars" value="2"/>
               <parameter key="max_chars" value="50"/>
             </operator>
             <operator activated="true" class="text:transform_cases" compatibility="5.2.004" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="210"/>
             <operator activated="true" class="text:stem_snowball" compatibility="5.2.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="179" y="210">
               <parameter key="language" value="Dutch"/>
             </operator>
             <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.2.004" expanded="true" height="76" name="Filter Stopwords (Dictionary)" width="90" x="313" y="210">
               <parameter key="file" value="H:\text data\Vlaamse stopwoorden.txt"/>
             </operator>
             <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="451" y="211"/>
             <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.2.004" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="581" y="210">
               <parameter key="max_length" value="6"/>
             </operator>
             <connect from_port="document" to_op="Extract Content" to_port="document"/>
             <connect from_op="Extract Content" from_port="document" to_op="Tokenize" to_port="document"/>
             <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
             <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Transform Cases" to_port="document"/>
             <connect from_op="Transform Cases" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
             <connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
             <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
             <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
             <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <connect from_op="Read Excel" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
         <connect from_op="Process Documents from Data" from_port="word list" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="18"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
  • Skirzynski
    Skirzynski New Altair Community Member
    Data in code-tags are just fine.  :)

    The good news is that I could reproduce your issue. Even better, I know which operator is the reason for this. If you remove the "Extract Content" operator you will see correct values for the label. The bad news is that I am not sure why this happens and I am still investigating. It seems that it has something to do with some meta data this operator is adding to the document.

    As a workaround you can remove the HTML-tags without using the "Extract Content" operator. This can be done by using the "Replace" operator which you have to insert before the "Process Documents From Data" operator. Use
    <[/]?[^>]*>
    as the regular expression in the parameter "replace what" and leave "replace by" empty. This should remove all tags like the "Extract Content" operator did.

    I hope this helps
      Marcin
  • arno
    arno New Altair Community Member
    Thanks a lot for the feedback,

    I will test your solution soon and give you feedback on it once I have it ;).

    thanks again,

    Arno