"Text Plugin - StringTokenizer doesn't tokenize alphanumeric strings"

New Altair Community Member

Sep 30, 2009

Updated Nov 5, 2024 by Jocelyn

I'm using the StringTokenizer (part of the Text plugin (v. 4.5)) to try and create tokens from text. The problem I am having is that my text documents contain alphanumeric codes that I would like to tokenize. It appears the StringTokenizer only tokenizes words and removes any numeric or special characters.

For example, within my text document I may have three alphanumeric codes as shown below. Using the StringTokenizer will result in a single token called "C" (all numbers and special characters are removed). What I would like is for the StringTokenizer to find two tokens ("C847_0" and "C372_-1") for this document.

C847_0  C372_-1  C847_0

Are there any options in the StringTokenizer that I can set to allow alphanumeric tokens? Or is there an operator that simply creates attributes by splitting on spaces and then I could simply filter out whatever type of attributes I don't need (e.g., purely numeric tokens)?

Any help would be appreciated.

Thanks.

Find more posts tagged with

AI Studio

Text Mining + NLP

Sort by:

1 - 5 of 51

Ryujakk

New Altair Community Member

Oct 1, 2009

Hi there,

I had the exact same problem. One solution might be to download RapidMiner 5 beta, which should be available quite soon according to Tobias Malbrecht. Or you can try the text plugin extension I wrote to solve this. It might still be buggy though...

Here is the link if you want to try it out: http://www.filedropper.com/rapidminer-advancedstringtokenizer-46

Just download the jar, and copy it to "RapidMiner\lib\plugins" Remove the previous text plugin first though!

- R

land

New Altair Community Member

Oct 1, 2009

Hi,
although not official yet, the beta is already downloadable on source forge.

Greetings,
Sebastian

James

New Altair Community Member

Oct 1, 2009

Hi,

Ryujakk - thanks for the link. I'm unable to download the plugin right now because my work blocks access to filedropper.com, but I will try it out later.

Sebastian - thank you for the information about the beta; I went ahead and installed the software (it looks great!). Are the text operators part of the core in version 5.0? (The link below mentions the operators may be part of 5.0.) If so, I'm having trouble locating the operators.

http://rapid-i.com/rapidforum/index.php/topic,1183.0.html

If not, do you know if a version 5.0 of the text plugin will be released soon? The reason I ask is because when I "install" version 4.5 (or 4.6) of the text plugin in RM5.0B, I can't find the operators anywhere.

Thanks so much your help.

land

New Altair Community Member

Oct 2, 2009

Hi,
the RapidMiner Beta does not support plugins. I'm currently working on adapting (and extending) the mechanism, so that it works with 5.0.

It's true, that the new text processing operators (which are NOT just integrated from the TextPlugin, but have been completely redesigned) have been removed again from core in the beta and will be published separately in the near future.

Greetings,
Sebastian

James

New Altair Community Member

Oct 2, 2009

Sebastian,

Thanks for the info about plugins not being supported in the beta. I look forward to seeing what the extended text operators will be capable of when they are released.

Until the new text processing operators are released I will use the plugin from Ryujakk (thanks again!). Ryuajakk's AdvancedStringTokenizer did just what I needed; it was able to tokenize my text using spaces while retaining numeric and special characters.

Thanks for your help.

James
**********

🎉Community Raffle - Win $25

"Text Plugin - StringTokenizer doesn't tokenize alphanumeric strings"

Find more posts tagged with

Quick Links