"filter by upper case letter?"

Question

Hey there,

I just recently installed Rapid Miner for a university project. I only worked with R so far so this is quite new and challenging for me.
I want to extract text from newspaper frontpages as part of analyzing agenda setting in German politics.

My question would be if it is possible to filter by upper case letter... German nouns start with upper case and I would like to filter that. Unfortunately, I have no idea how to do that. Any help is appreciated :)

erocoar · Answer

Oh amazing! Thank you so much :)  This really helps a lot. JEdward, how did you manage to turn filter tokens from string to regular expression?

MartinLiebig · Answer

Hi erocoar, if you are interested in german nouns, you can use Filter POS as well. There you can specifically search for Nouns, Adjectives etc. German and English are supported. The process below uses it to get nouns out of the document. Of course you can use this in Process Documents. Further details on the syntax is available on: http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html ~Martin

JEdward · Answer

It's a bit early for me today, but you should be able to do it with Filter Tokens & a regular expression. Don't be scared of regular expressions this one is especially straightforward. - ^ means start at the beginning of the text, as you are filtering within the tokens the start should be - [A-Z] means any uppercase letter between A & Z - . dot means any character at all. - * asterix means any number of the preceding element (in this case . ) Have a play with the example below, simply copy & paste the XML into the XML view of RapidMiner and press the green tick to load it.