"Problems with filtering attributes with regex"

Question

Hi experts, I have to create a cooccurrence graph and so I create a corpus and a occurrence matrix. With the occurrence matrix I have some problems, I can't get it to filter words with 3 or more letters for my analysing. When I use for example [(0-9)+][-!"#$%&'()*+,./:;<=>?@$$\$$_`{|}~][(0-9)+] [(a-z){3,}] all coulums will be deleted. Has anyone an idea to fix this problem? ?@$$\$$_`{|}~][(0-9)+] [(a-z){3,}] "/>

Telcontar120 · Accepted Answer

Inside your "Process Documents" after you have Tokenized your words,simply use the "Filter Token by Length" operator and set it to minimum length desired.  That's a much easier way to get to what you are trying to accomplish I think.

BalazsBaranyRM · Accepted Answer

Hi!

You have a highly complex and very specific regex. I wasn't even able to find a text that it matches.

The use of character classes [] and parentheses () the way you're doing it is not very common. This would be more standard usage: [a-z()] (if you're really matching lower case characters and the opening and closing parentheses).

The regexp also has a space at the end.

In Select Attributes, the regexp must match the whole attribute name. (Usually regexes just need to match a part of the target, Select Attributes is different in this regard.)

When developing regexes, it's best to start from a simple state and then build up on that, using RapidMiner's testing methods.

If I understand your problem, the regex (\w+-){2}\w+ would be a simple representation of "word-word-word". You can start from this and build upon it.

Regards,

Balázs