Text filtering

Dear All,

I am new to RapidMiner and have an issue where I do not really know how to start it:

I have the following data:
- One file (pdf, txt or html) with a collection of 1000 different news articles.
- A list with about 30 keywords.
I want to extract all those articles, that match at least with one of the keywords.

My questions are:
1. What do I have to do such that RapidMiner can distinguish where an article starts and ends? When I import my news articles with the operator „Read Data“ it seems to me that the whole data is considered as „one article“.

2. What kind of process do I need to set up to extract only those articles that contain one of the key words. Specifically, which operator would work best? I tried „Filter Documents (by content)“ but I don’t understand where I should integrate my keywords.

Thank you so much!

Best,
Carl

Find more posts tagged with

AI Studio

Text Mining + NLP

Getting Started

Filtering

Accepted answers

All comments

Thomas_Ott

Hi Carl,

Did you get a chance to read through this part of the Community: http://community.rapidminer.com/t5/Text-Analytics-in-RapidMiner/tkb-p/Text

If all your documents are in one file and you want to seperate them, you will need to use the Cut Document operator to slice them into seperate entities.

Thomas_Ott

Hi Carl,

Did you get a chance to read through this part of the Community: http://community.rapidminer.com/t5/Text-Analytics-in-RapidMiner/tkb-p/Text

If all your documents are in one file and you want to seperate them, you will need to use the Cut Document operator to slice them into seperate entities.

Telcontar120

After you have dealt with the separation of the documents as @Thomas_Ott describes, you will next probably want to process the documents and create a word vector. In your case, binary term occurrences may be helpful, since that will create a simple 0/1 indicator for each token (in your case probably individual words, although you can also do n-gams for phrases of more than 1 word) and then cross-reference that to identify which documents contained any of the key terms. You may also need to do some token replacement or stemming if you have synonymous terms or variations, but it should be fairly straightforward.

jana_janarthani

hai dear all,

I'm new for RapidMiner. I need help from u. I need take keywords from one news. then i have to compare with other newses for take best news.

can you help to me?

thank you.

prasanth

sgenzer

hello @jana_janarthani - welcome to the community. That's not exactly a question defined enough for us to answer here. May I suggest you begin by look at the library of support materials that we have?

https://community.rapidminer.com/t5/Getting-Started-Forum/Essential-RapidMiner-Resources-for-New-Users/m-p/41212#M825

Scott