🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"Regular expressions?"

User: "sgtrock"
New Altair Community Member
Updated by Jocelyn
I've been messing around with text processing for log analysis for almost 30 years.  I've used a variety of languages in that time.

One of the more painful facts that I've had to learn is that nobody implements regular expressions in quite the same way.  I'm tripping over this yet again with RapidMiner and it's becoming a real source of frustration for me.  The manual is silent on the subject.  (Given how critical this function is to data analysis, I found that lack to be disturbing to say the least!).

Does anyone out there know of a good resource for creating regex in RapidMiner?  Searching the forum archives repeatedly turns up references to Java's regular expression documentation.  That in turn refers back to _perl's_ regex documentation, with exceptions noted.

None of that documenation tells me how RapidMiner 5.0 actually interprets regexes.  Attempting to find the right syntax is chewing up a great deal of my time.  Is there's a broad set of examples out there?  Or a truly thorough discussion of how to properly define regexes within the GUI?  I'd love to read something like that.

For example, take my current struggle.  I'm wading through a long list of software that was entered in by several different people over the years.  In addition, the rules for what was entered where and when changed as the database grew. 

The first thing that I want to do is simply count the number of versions of software that's out there.  Unfortunately, a lot of the old entries include the version number as part of the asset name, so I have to strip that out.  With me so far?

Here's a typical example (all examples from the same attribute, Asset):
Illustrator
Illustrator CS
Illustrator CS2
Illustrator CS3
Illustrator CS4
Dreamweaver
Dreamweaver CS3
Dreamweaver CS4
Photoshop
Photoshop CS
Photoshop CS4
In this instance, I'd love to just look for CS and CS[2-4] and strip them off the entries.

And another:
Extra! v6.7
Extra! v9
And another:

Netware Client v4.9 SP1 (IP)
Netware Client v4.9 SP1 (IP/IPX)
Netware Client 4
There's a lot more where these came from.  At the moment, I'm tackling this by going through the list, adding one entry at a time to a Map operator.  It's a painful, trial and error process at this point.

The issue that I'm struggling with is that I can't predict what end result I'm going to get.  For example, this works:
Illustrator CS[2-4]   :  Illustrator
but naturally doesn't eliminate the Illustrator CS entry.  I had to created a separate Map entry for it.  However, now I need to do the same for all the other programs in the Adobe Creative Suite.

This doesn't:
Extra!\sv*  :  Extra!
Neither does this:
Extra! v*  :  Extra!
Nor does this:
Netware Client *   :  Netware Client
Clearly, I'm doing something wrong.  (Maybe I'm not holding my lower lip right?  :D)  I would love any guidance that anyone might have.

Find more posts tagged with