Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
Text Pre-processing
Hyram
Hi there
I am trying to do some preprocessing on text and looking for the relevant operators in RapidMiner, if they are indeed available.
I am extracting features from a sentence, using Information Gain operator. This seems to be possible. From there, I need to construct a feature vector using Bag of Words (BOW) and Term Frequency (TF). I should end up with a vector of unigrams. I want this vector of unigrams to be based on Part of Speech (POS) for each term in the sentence.
The operators I am looking for are:
1. BOW;
2. TF;
3. PoS tagging.
Are these available in RapidMiner or am I looking in the wrong operator directories?
Thanks
Find more posts tagged with
AI Studio
Text Mining + NLP
Accepted answers
Pavithra_Rao
Hi
@Hyram
,
Text processing extension could help you out with text mining feature extraction and processing.
Also here's some good resource to help you get started with RM Text processing
https://academy.rapidminer.com/learn/course/text-and-web-mining-with-rapidminer/text-and-web-mining/lets-get-started
https://rapidminer.com/resource/text-mining-document-classification/
https://rapidminer.com/resource/text-mining-document-classification/
Cheers,
Pavithra
Telcontar120
You need to download and install the free text mining extension from the marketplace.
The operator "Process Documents" will generate a word vector using term frequency if you set that as the option in the parameters (TF-IDF is the default), and it will also automatically generate the bag of words for you if you use the Tokenize operator inside and then output the wordlist and the exampleset (depending on the format you want it in).
There is also an operator for "Filter Tokens (by POS Tags)" but I am not sure if you can get it to actually output the POS tag, or whether you can only filter by the tags (in which case I guess you could add them manually based on the filtered results? but that seems inefficient).
@mschmitz
is there any way to output the POS tag directly?
MartinLiebig
@Hyram
i think this can only filter, but i haven't used this in a while. Maybe the wordnet extension can help?
BR,
Martin
Telcontar120
Yes the values in the word vector correspond to the TF-IDF values calculated across the exampleset.
All comments
Pavithra_Rao
Hi
@Hyram
,
Text processing extension could help you out with text mining feature extraction and processing.
Also here's some good resource to help you get started with RM Text processing
https://academy.rapidminer.com/learn/course/text-and-web-mining-with-rapidminer/text-and-web-mining/lets-get-started
https://rapidminer.com/resource/text-mining-document-classification/
https://rapidminer.com/resource/text-mining-document-classification/
Cheers,
Pavithra
Telcontar120
You need to download and install the free text mining extension from the marketplace.
The operator "Process Documents" will generate a word vector using term frequency if you set that as the option in the parameters (TF-IDF is the default), and it will also automatically generate the bag of words for you if you use the Tokenize operator inside and then output the wordlist and the exampleset (depending on the format you want it in).
There is also an operator for "Filter Tokens (by POS Tags)" but I am not sure if you can get it to actually output the POS tag, or whether you can only filter by the tags (in which case I guess you could add them manually based on the filtered results? but that seems inefficient).
@mschmitz
is there any way to output the POS tag directly?
Hyram
Thanks for your assistance Telcontar120 and Pavithra_Rao!
Hyram
Hi
@mschmitz
Brian has answered my question perfectly. The only issue outstanding is how I use PoS tagging. Can I reflect the tags or only filter by them?
Thanks
MartinLiebig
@Hyram
i think this can only filter, but i haven't used this in a while. Maybe the wordnet extension can help?
BR,
Martin
Hyram
Thank you.
I have one more question
@mschmitz
. How would I remove hashtags and URLs from my text? What operator would I use? Replace? I've looked at the previous posts around this and the tutorial which a community member suggested I look at, no longer exists.
Thanks
Telcontar120
Yes, you should be able to use the Replace operator to get rid of hashtags and URLs with some creative regular expressions. I am not a regex expert so there are other community members who could probably help with the specifics of that more than I can. In both cases you are probably going to want to look for some pattern (such as the # sign or the htttps://) followed by some arbitrary number of characters, then a space---and you want to get rid of everything up to the space.
Hyram
@Telcontar120
Thank you so much for your help. I have managed to produce the word vector now and the example set, using TF.IDF. I assume the values in the example set (fractions), represent the TF.IDF number?
Just need to sort out URLs now. Filter Tokens using 'non-letter', seems to sort out # but not the test immediately after, as you've suggested. At least now I know that I need to look for regular expressions.
Thanks again!
Telcontar120
Yes the values in the word vector correspond to the TF-IDF values calculated across the exampleset.
Hyram
@Telcontar120
awesome! Thank you
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups