Text Analysis on documents collection coming from a CSV

gustavo_velho
gustavo_velho New Altair Community Member
edited November 5 in Community Q&A

Hello!

 

I'm new to Rapidminer, and my main focus is to use it for text analysis for social media posts. I have a CSV file with several columns, and each row is a post/document. One of the columns is the text/body of the document. How can I select only that specific column for text analysis, but, at the same time, keep all other columns for further analysis, since they are still relevant?

 

Right now I have a process like:

 

Read CSV -> Select Atributes (to select only body column) -> Data to Documents -> Process Documents (Tokenize, Transform cases, N-Grams etc)  -> WordList to Data

 

This works to see the list of most common words/n-grams, but now I lost all the related data for each document. I would like to, for example, filter the documents containing a specific n-gram or word. Any tip would be helpful.

 

Thanks!

 

Gustavo Velho

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Gustavo,

     

    simply use "Keep text" in the Process Documents operator. That way you should have an additional attribute with the text together with your bag of words in the upper port of the operator.

     

    ~Martin

  • gustavo_velho
    gustavo_velho New Altair Community Member

    Thanks Martin! That seems to make sense, I'll test it. But let me add this: what about other data from a document? I have a file like:

    AUTHOR | DATE | CONTENT | SOURCE

    A                10/26   Lorem Ipsum...   http://source.com

    B                10/27   Lorem Ipsum...   http://source.com

     

    I see that Rapidminer has several other statistics, so I would like to benefit from that also after text analysi.

     

    Thanks again!

    Gustavo Velho

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi,

    Process Document should preserve the ID attribute as well. That way you can simply join the resulting bag of word example set with the former. Maybe Process Documents is also preserving all special roles. Would need to check this.

     

    ~Martin

  • gustavo_velho
    gustavo_velho New Altair Community Member

    Thanks Martin! That makes sense. I was figuring that out, that I would need to join documents table with the words list or something. :)

     

    I've been using other tools for text analysis, and now I'm starting to test Rapidminer. Rapidminer seems to have a better tokenization process so far, so let's see how the rest goes.

     

    Appreciate your help!

     

    Gustavo