Text filtering

caknobla
caknobla New Altair Community Member
edited November 5 in Community Q&A

Dear All,

I am new to RapidMiner and have an issue where I do not really know how to start it:

I have the following data:
    - One file (pdf, txt or html) with a collection of 1000 different news articles.
    - A list with about 30 keywords.
I want to extract all those articles, that match at least with one of the keywords.

My questions are:
1. What do I have to do such that RapidMiner can distinguish where an article starts and ends? When I import my news articles with the operator „Read Data“ it seems to me that the whole data is considered as „one article“.

2. What kind of process do I need to set up to extract only those articles that contain one of the key words. Specifically, which operator would work best? I tried „Filter Documents (by content)“ but I don’t understand where I should integrate my keywords.


Thank you so much!

Best,
Carl

Answers

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Hi Carl,


    Did you get a chance to read through this part of the Community: http://community.rapidminer.com/t5/Text-Analytics-in-RapidMiner/tkb-p/Text

     

     

    If all your documents are in one file and you want to seperate them, you will need to use the Cut Document operator to slice them into seperate entities. 

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Hi Carl,


    Did you get a chance to read through this part of the Community: http://community.rapidminer.com/t5/Text-Analytics-in-RapidMiner/tkb-p/Text

     

     

    If all your documents are in one file and you want to seperate them, you will need to use the Cut Document operator to slice them into seperate entities. 

  • Telcontar120
    Telcontar120 New Altair Community Member

    After you have dealt with the separation of the documents as @Thomas_Ott describes, you will next probably want to process the documents and create a word vector.  In your case, binary term occurrences may be helpful, since that will create a simple 0/1 indicator for each token (in your case probably individual words, although you can also do n-gams for phrases of more than 1 word) and then cross-reference that to identify which documents contained any of the key terms.  You may also need to do some token replacement or stemming if you have synonymous terms or variations, but it should be fairly straightforward.

     

  • jana_janarthani
    jana_janarthani New Altair Community Member

    hai dear all,

     I'm new for RapidMiner. I need help from u. I need take keywords from one news. then i have to compare with other newses for take best news. 

    can you help to me?

     

    thank you.

    prasanth

     

  • sgenzer
    sgenzer
    Altair Employee

    hello @jana_janarthani - welcome to the community.  That's not exactly a question defined enough for us to answer here.  May I suggest you begin by look at the library of support materials that we have? 

     

    https://community.rapidminer.com/t5/Getting-Started-Forum/Essential-RapidMiner-Resources-for-New-Users/m-p/41212#M825

     

    Scott