Filter: 1) extract numeric information from text column 2) select attributes subset based on a table

st0nyde
st0nyde New Altair Community Member
edited November 5 in Community Q&A
Dear all,

i'm kinda new dealing with RapidMiner, and hope some of you in the community is able to help me with my problem. I have already experience with other ETL and data management tools but did not find a way within RapidMiner to tackle it correctly.

I have two questions. 1 is more important, 2 is nice2know

1) extract numeric information with a pattern in text column

I am trying to solve a data preparation task, where i have an attribute with type text containing sentences, descriptions in german language and numeric values which are interesting as well. Therefore I already prepared some tutorials, searched in the community for possible solutions and experimented with operators and their parameters, including RegEx logic.

Within the use-case it is needed to multiply the dataset for on the one hand side extract text information and on the other hand extract the values included in the text column to match them in the end together. Performing text patterns is no problem with the Text Processing Extension (Transform Cases, Tokenize, Filter Stopwords, Stem), but the extracting the numerical data included in the same column in a separate process makes me frustrating
Describing it in RegEx the searched pattern is describable by ([€0-9.;,\- ]+[-€]). All other information can be removed from the text in the column in the numeric information extraction stream.

What i tried so far:

- Process documents from data: i had a problem with the "€" character, so i tried to replace it with "E" after I transformed the cases [A-Z] to [a-z] in order to have only the E left in the data - within the process documents and outside. But i was not able to get the expected results - via process documents it was in most of the cases empty (no results)
 
- replaceAll (Replace, Generate Attributes, Tokenize) function with RegEx. I tried to exclude a lot of character combination, but in the end it was too complex, so i focused on the inverse function which is describable by ([€0-9.;,\- ]+[-€])

Question:
Is it possible to receive an inverse of a RegEx, that instead of replacing leave the already identified pattern values active as a resultset. Preferred within a new column in order to compare/qualify it with the text? Or do you know a way where i can easily extract numeric values following a pattern (€ sign before or after with some special characters like 50.000,-- € , €500, 500€). Based on this information i wanted to create some new metrics so I am actually stuck in process.

2) Select attributes based on another table (e.g. by extending join condition)

Idea:
I have a data table containing 82 attributes which i want to reduce, but not the imported table because if I want to reload different samples I would have to manipulate every sample set.
Instead of selecting a subset manually (Retrieve -> Select Attributes) it would be great to be able to script the relevant features for future cases. Therefore I thought about generating a second table with only the features without observations for a flag (just master data) - logic where only the attributes receiving the flag value 1 are relevant.

Question:
Is it possible to define a subset based on a table input? Tried it with the weighted operator but was actually not really successfull.

Hope you are able to understand my challenges and give me some hints based on your RapidMiner experiences :smile:

Regards,

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    For the first question I believe you should be able to use the "replace" operator and then simply use the capture group notation (E\d+) to keep the part that you are interested in (using E as the Euro sign here).     

    For the 2nd question, I am not totally sure what you are asking or what you are trying to accomplish with the scripting.  What is it about the ordinary "Select Attributes" setup that doesn't work for you?  If you have the same set of attributes that you want no matter what the input set then you should be able to re-use the same Select Attributes operator .
  • st0nyde
    st0nyde New Altair Community Member
    edited May 2020
    Hi Telcontar120,

    first of all thanks for your support. 

    Considering 1) in general this should identify the part I am interested in but am i able to store it in an easy way via "Generate Attributes" or another operator? I only found the way to perform a replace method for RegEx patterns and the negotiation via ^ or ! did not really work. So it is the same result as i already had and I am looking for a way to leave only / extract the searched pattern values instead of replacing them.

    Considering 2) you are absolutely right, just coming from another perspective but it is not neccessary to do it in this way. Wanted to know if it's possible to mix up an attribute selection based on flagged input variables coming from another data source. 

    Regards,