"Regular expressions?"
I've been messing around with text processing for log analysis for almost 30 years. I've used a variety of languages in that time.
One of the more painful facts that I've had to learn is that nobody implements regular expressions in quite the same way. I'm tripping over this yet again with RapidMiner and it's becoming a real source of frustration for me. The manual is silent on the subject. (Given how critical this function is to data analysis, I found that lack to be disturbing to say the least!).
Does anyone out there know of a good resource for creating regex in RapidMiner? Searching the forum archives repeatedly turns up references to Java's regular expression documentation. That in turn refers back to _perl's_ regex documentation, with exceptions noted.
None of that documenation tells me how RapidMiner 5.0 actually interprets regexes. Attempting to find the right syntax is chewing up a great deal of my time. Is there's a broad set of examples out there? Or a truly thorough discussion of how to properly define regexes within the GUI? I'd love to read something like that.
For example, take my current struggle. I'm wading through a long list of software that was entered in by several different people over the years. In addition, the rules for what was entered where and when changed as the database grew.
The first thing that I want to do is simply count the number of versions of software that's out there. Unfortunately, a lot of the old entries include the version number as part of the asset name, so I have to strip that out. With me so far?
Here's a typical example (all examples from the same attribute, Asset):
And another:
The issue that I'm struggling with is that I can't predict what end result I'm going to get. For example, this works:
This doesn't:
) I would love any guidance that anyone might have.
One of the more painful facts that I've had to learn is that nobody implements regular expressions in quite the same way. I'm tripping over this yet again with RapidMiner and it's becoming a real source of frustration for me. The manual is silent on the subject. (Given how critical this function is to data analysis, I found that lack to be disturbing to say the least!).
Does anyone out there know of a good resource for creating regex in RapidMiner? Searching the forum archives repeatedly turns up references to Java's regular expression documentation. That in turn refers back to _perl's_ regex documentation, with exceptions noted.
None of that documenation tells me how RapidMiner 5.0 actually interprets regexes. Attempting to find the right syntax is chewing up a great deal of my time. Is there's a broad set of examples out there? Or a truly thorough discussion of how to properly define regexes within the GUI? I'd love to read something like that.
For example, take my current struggle. I'm wading through a long list of software that was entered in by several different people over the years. In addition, the rules for what was entered where and when changed as the database grew.
The first thing that I want to do is simply count the number of versions of software that's out there. Unfortunately, a lot of the old entries include the version number as part of the asset name, so I have to strip that out. With me so far?
Here's a typical example (all examples from the same attribute, Asset):
IllustratorIn this instance, I'd love to just look for CS and CS[2-4] and strip them off the entries.
Illustrator CS
Illustrator CS2
Illustrator CS3
Illustrator CS4
Dreamweaver
Dreamweaver CS3
Dreamweaver CS4
Photoshop
Photoshop CS
Photoshop CS4
And another:
Extra! v6.7And another:
Extra! v9
There's a lot more where these came from. At the moment, I'm tackling this by going through the list, adding one entry at a time to a Map operator. It's a painful, trial and error process at this point.
Netware Client v4.9 SP1 (IP)
Netware Client v4.9 SP1 (IP/IPX)
Netware Client 4
The issue that I'm struggling with is that I can't predict what end result I'm going to get. For example, this works:
Illustrator CS[2-4] : Illustratorbut naturally doesn't eliminate the Illustrator CS entry. I had to created a separate Map entry for it. However, now I need to do the same for all the other programs in the Adobe Creative Suite.
This doesn't:
Extra!\sv* : Extra!Neither does this:
Extra! v* : Extra!Nor does this:
Netware Client * : Netware ClientClearly, I'm doing something wrong. (Maybe I'm not holding my lower lip right?
