Text filtering
Dear All,
I am new to RapidMiner and have an issue where I do not really know how to start it:
I have the following data:
- One file (pdf, txt or html) with a collection of 1000 different news articles.
- A list with about 30 keywords.
I want to extract all those articles, that match at least with one of the keywords.
My questions are:
1. What do I have to do such that RapidMiner can distinguish where an article starts and ends? When I import my news articles with the operator „Read Data“ it seems to me that the whole data is considered as „one article“.
2. What kind of process do I need to set up to extract only those articles that contain one of the key words. Specifically, which operator would work best? I tried „Filter Documents (by content)“ but I don’t understand where I should integrate my keywords.
Thank you so much!
Best,
Carl
Answers
-
Hi Carl,
Did you get a chance to read through this part of the Community: http://community.rapidminer.com/t5/Text-Analytics-in-RapidMiner/tkb-p/TextIf all your documents are in one file and you want to seperate them, you will need to use the Cut Document operator to slice them into seperate entities.
0 -
Hi Carl,
Did you get a chance to read through this part of the Community: http://community.rapidminer.com/t5/Text-Analytics-in-RapidMiner/tkb-p/TextIf all your documents are in one file and you want to seperate them, you will need to use the Cut Document operator to slice them into seperate entities.
0 -
After you have dealt with the separation of the documents as @Thomas_Ott describes, you will next probably want to process the documents and create a word vector. In your case, binary term occurrences may be helpful, since that will create a simple 0/1 indicator for each token (in your case probably individual words, although you can also do n-gams for phrases of more than 1 word) and then cross-reference that to identify which documents contained any of the key terms. You may also need to do some token replacement or stemming if you have synonymous terms or variations, but it should be fairly straightforward.
0 -
hai dear all,
I'm new for RapidMiner. I need help from u. I need take keywords from one news. then i have to compare with other newses for take best news.
can you help to me?
thank you.
prasanth
0 -
hello @jana_janarthani - welcome to the community. That's not exactly a question defined enough for us to answer here. May I suggest you begin by look at the library of support materials that we have?
Scott
0