"Generate Attributes - function expression OR regex for
sirhc
New Altair Community Member
Hello together,
i have a nominal attribute title which contains a text description and between the text description the year (4 digits). Sometimes there are also some other digits in the text. So i have to search for "4 digits within the text" and generate a new attribute for year.
Example:
title = "that is the 1st test attribute 2019 but not the last one."
Now i want to extract the year of the title attribute.
Year = 2019
I tried it first with regex and the Replace operator with the regex "\d{4}" but i only could replace the digits and not extract into a new attribute.
Can someone please help me or give an idea how to solve this issue.
Thank you in advance, i am a newbie to rapidminer.
Best,
Chris
i have a nominal attribute title which contains a text description and between the text description the year (4 digits). Sometimes there are also some other digits in the text. So i have to search for "4 digits within the text" and generate a new attribute for year.
Example:
title = "that is the 1st test attribute 2019 but not the last one."
Now i want to extract the year of the title attribute.
Year = 2019
I tried it first with regex and the Replace operator with the regex "\d{4}" but i only could replace the digits and not extract into a new attribute.
Can someone please help me or give an idea how to solve this issue.
Thank you in advance, i am a newbie to rapidminer.
Best,
Chris
Tagged:
0
Best Answers
-
Hi Chris @sirhcYou almost had it. The missing piece was to use a so-called capturing group in your replace-what parameter (round brackets) and use the number of the capturing group in the replace-by parameter with a dollar sign.Replace-what: .*(\d{4}).*Replace-by: $1The process below shows a simple example. Please note that it uses the last occurrence of a year in case there are multiple years. If there is no year in the title, it returns the complete title instead.Hope this helps,Ingo
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="UTF-8"/><br> <process expanded="true"><br> <operator activated="true" class="utility:create_exampleset" compatibility="9.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34"><br> <parameter key="generator_type" value="comma separated text"/><br> <parameter key="number_of_examples" value="100"/><br> <parameter key="use_stepsize" value="false"/><br> <list key="function_descriptions"/><br> <parameter key="add_id_attribute" value="false"/><br> <list key="numeric_series_configuration"/><br> <list key="date_series_configuration"/><br> <list key="date_series_configuration (interval)"/><br> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/><br> <parameter key="time_zone" value="SYSTEM"/><br> <parameter key="input_csv_text" value="Title That is the 1st test attribute 2019 but not the last one Another title - this time containing the year 2015 2001 was a fantastic year and a strange movie And this here contains the year 1977 but also the number 42 Finally here we have two years - 2008 and 2009"/><br> <parameter key="column_separator" value=","/><br> <parameter key="parse_all_as_nominal" value="false"/><br> <parameter key="decimal_point_character" value="."/><br> <parameter key="trim_attribute_names" value="true"/><br> </operator><br> <operator activated="true" class="replace" compatibility="9.2.000" expanded="true" height="82" name="Replace" width="90" x="179" y="34"><br> <parameter key="attribute_filter_type" value="all"/><br> <parameter key="attribute" value=""/><br> <parameter key="attributes" value=""/><br> <parameter key="use_except_expression" value="false"/><br> <parameter key="value_type" value="nominal"/><br> <parameter key="use_value_type_exception" value="false"/><br> <parameter key="except_value_type" value="file_path"/><br> <parameter key="block_type" value="single_value"/><br> <parameter key="use_block_type_exception" value="false"/><br> <parameter key="except_block_type" value="single_value"/><br> <parameter key="invert_selection" value="false"/><br> <parameter key="include_special_attributes" value="false"/><br> <parameter key="replace_what" value=".*(\d{4}).*"/><br> <parameter key="replace_by" value="$1"/><br> </operator><br> <connect from_op="Create ExampleSet" from_port="output" to_op="Replace" to_port="example set input"/><br> <connect from_op="Replace" from_port="example set output" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process>
1 -
Hi Chris,The process below takes care of cases without a year (the extracted value is then missing) and calculates the age.Hope this helps,Ingo
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="UTF-8"/><br> <process expanded="true"><br> <operator activated="true" class="utility:create_exampleset" compatibility="9.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34"><br> <parameter key="generator_type" value="comma separated text"/><br> <parameter key="number_of_examples" value="100"/><br> <parameter key="use_stepsize" value="false"/><br> <list key="function_descriptions"/><br> <parameter key="add_id_attribute" value="false"/><br> <list key="numeric_series_configuration"/><br> <list key="date_series_configuration"/><br> <list key="date_series_configuration (interval)"/><br> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/><br> <parameter key="time_zone" value="SYSTEM"/><br> <parameter key="input_csv_text" value="Title That is the 1st test attribute 2019 but not the last one Another title - this time containing the year 2015 2001 was a fantastic year and a strange movie This is a title without any year And this here contains the year 1977 but also the number 42 Finally here we have two years - 2008 and 2009"/><br> <parameter key="column_separator" value=","/><br> <parameter key="parse_all_as_nominal" value="false"/><br> <parameter key="decimal_point_character" value="."/><br> <parameter key="trim_attribute_names" value="true"/><br> </operator><br> <operator activated="true" class="generate_copy" compatibility="9.2.000" expanded="true" height="82" name="Generate Copy" width="90" x="179" y="34"><br> <parameter key="attribute_name" value="Title"/><br> <parameter key="new_name" value="Year"/><br> </operator><br> <operator activated="true" class="replace" compatibility="9.2.000" expanded="true" height="82" name="Replace" width="90" x="313" y="34"><br> <parameter key="attribute_filter_type" value="single"/><br> <parameter key="attribute" value="Year"/><br> <parameter key="attributes" value=""/><br> <parameter key="use_except_expression" value="false"/><br> <parameter key="value_type" value="nominal"/><br> <parameter key="use_value_type_exception" value="false"/><br> <parameter key="except_value_type" value="file_path"/><br> <parameter key="block_type" value="single_value"/><br> <parameter key="use_block_type_exception" value="false"/><br> <parameter key="except_block_type" value="single_value"/><br> <parameter key="invert_selection" value="false"/><br> <parameter key="include_special_attributes" value="false"/><br> <parameter key="replace_what" value=".*(\d{4}).*"/><br> <parameter key="replace_by" value="$1"/><br> </operator><br> <operator activated="true" class="replace" compatibility="9.2.000" expanded="true" height="82" name="Replace (2)" width="90" x="447" y="34"><br> <parameter key="attribute_filter_type" value="single"/><br> <parameter key="attribute" value="Year"/><br> <parameter key="attributes" value=""/><br> <parameter key="use_except_expression" value="false"/><br> <parameter key="value_type" value="nominal"/><br> <parameter key="use_value_type_exception" value="false"/><br> <parameter key="except_value_type" value="file_path"/><br> <parameter key="block_type" value="single_value"/><br> <parameter key="use_block_type_exception" value="false"/><br> <parameter key="except_block_type" value="single_value"/><br> <parameter key="invert_selection" value="false"/><br> <parameter key="include_special_attributes" value="false"/><br> <parameter key="replace_what" value="[^\d]"/><br> </operator><br> <operator activated="true" class="parse_numbers" compatibility="9.2.000" expanded="true" height="82" name="Parse Numbers" width="90" x="581" y="34"><br> <parameter key="attribute_filter_type" value="single"/><br> <parameter key="attribute" value="Year"/><br> <parameter key="attributes" value=""/><br> <parameter key="use_except_expression" value="false"/><br> <parameter key="value_type" value="nominal"/><br> <parameter key="use_value_type_exception" value="false"/><br> <parameter key="except_value_type" value="file_path"/><br> <parameter key="block_type" value="single_value"/><br> <parameter key="use_block_type_exception" value="false"/><br> <parameter key="except_block_type" value="single_value"/><br> <parameter key="invert_selection" value="false"/><br> <parameter key="include_special_attributes" value="false"/><br> <parameter key="decimal_character" value="."/><br> <parameter key="grouped_digits" value="false"/><br> <parameter key="grouping_character" value=","/><br> <parameter key="infinity_representation" value=""/><br> <parameter key="unparsable_value_handling" value="fail"/><br> </operator><br> <operator activated="true" class="generate_attributes" compatibility="9.2.000" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="34"><br> <list key="function_descriptions"><br> <parameter key="Age" value="date_get(date_now(),DATE_UNIT_YEAR) - [Year]"/><br> </list><br> <parameter key="keep_all" value="true"/><br> </operator><br> <connect from_op="Create ExampleSet" from_port="output" to_op="Generate Copy" to_port="example set input"/><br> <connect from_op="Generate Copy" from_port="example set output" to_op="Replace" to_port="example set input"/><br> <connect from_op="Replace" from_port="example set output" to_op="Replace (2)" to_port="example set input"/><br> <connect from_op="Replace (2)" from_port="example set output" to_op="Parse Numbers" to_port="example set input"/><br> <connect from_op="Parse Numbers" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/><br> <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process>
1
Answers
-
Hi @sirhc,
For the example text, we have at least three options.
Extract Information
Keep document parts
Cut document
Can you give a test of these operators with regex?
My example process used two of them.<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
<context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="179" y="34"> <parameter key="text" value="that is the 1st test attribute 2019 but not the last one"/> <parameter key="add label" value="false"/> <parameter key="label_type" value="nominal"/> </operator> <operator activated="true" class="multiply" compatibility="9.2.000" expanded="true" height="103" name="Multiply" width="90" x="313" y="34"/> <operator activated="true" class="text:keep_document_parts" compatibility="8.1.000" expanded="true" height="68" name="Keep Document Parts" width="90" x="447" y="34"> <parameter key="extraction_regex" value="\ \d{4}\ "/> </operator> <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="447" y="187"> <parameter key="query_type" value="Regular Expression"/> <list key="string_machting_queries"/> <parameter key="attribute_type" value="Nominal"/> <list key="regular_expression_queries"> <parameter key="year" value="\ \d{4}\ "/> </list> <list key="regular_region_queries"/> <list key="xpath_queries"/> <list key="namespaces"/> <parameter key="ignore_CDATA" value="true"/> <parameter key="assume_html" value="true"/> <list key="index_queries"/> <list key="jsonpath_queries"/> </operator> <connect from_op="Create Document" from_port="output" to_op="Multiply" to_port="input"/> <connect from_op="Multiply" from_port="output 1" to_op="Keep Document Parts" to_port="document"/> <connect from_op="Multiply" from_port="output 2" to_op="Extract Information" to_port="document"/> <connect from_op="Keep Document Parts" from_port="document" to_port="result 1"/> <connect from_op="Extract Information" from_port="document" to_port="result 2"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> </process> </operator> </process>
YY
3 -
Hi Chris @sirhcYou almost had it. The missing piece was to use a so-called capturing group in your replace-what parameter (round brackets) and use the number of the capturing group in the replace-by parameter with a dollar sign.Replace-what: .*(\d{4}).*Replace-by: $1The process below shows a simple example. Please note that it uses the last occurrence of a year in case there are multiple years. If there is no year in the title, it returns the complete title instead.Hope this helps,Ingo
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="UTF-8"/><br> <process expanded="true"><br> <operator activated="true" class="utility:create_exampleset" compatibility="9.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34"><br> <parameter key="generator_type" value="comma separated text"/><br> <parameter key="number_of_examples" value="100"/><br> <parameter key="use_stepsize" value="false"/><br> <list key="function_descriptions"/><br> <parameter key="add_id_attribute" value="false"/><br> <list key="numeric_series_configuration"/><br> <list key="date_series_configuration"/><br> <list key="date_series_configuration (interval)"/><br> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/><br> <parameter key="time_zone" value="SYSTEM"/><br> <parameter key="input_csv_text" value="Title That is the 1st test attribute 2019 but not the last one Another title - this time containing the year 2015 2001 was a fantastic year and a strange movie And this here contains the year 1977 but also the number 42 Finally here we have two years - 2008 and 2009"/><br> <parameter key="column_separator" value=","/><br> <parameter key="parse_all_as_nominal" value="false"/><br> <parameter key="decimal_point_character" value="."/><br> <parameter key="trim_attribute_names" value="true"/><br> </operator><br> <operator activated="true" class="replace" compatibility="9.2.000" expanded="true" height="82" name="Replace" width="90" x="179" y="34"><br> <parameter key="attribute_filter_type" value="all"/><br> <parameter key="attribute" value=""/><br> <parameter key="attributes" value=""/><br> <parameter key="use_except_expression" value="false"/><br> <parameter key="value_type" value="nominal"/><br> <parameter key="use_value_type_exception" value="false"/><br> <parameter key="except_value_type" value="file_path"/><br> <parameter key="block_type" value="single_value"/><br> <parameter key="use_block_type_exception" value="false"/><br> <parameter key="except_block_type" value="single_value"/><br> <parameter key="invert_selection" value="false"/><br> <parameter key="include_special_attributes" value="false"/><br> <parameter key="replace_what" value=".*(\d{4}).*"/><br> <parameter key="replace_by" value="$1"/><br> </operator><br> <connect from_op="Create ExampleSet" from_port="output" to_op="Replace" to_port="example set input"/><br> <connect from_op="Replace" from_port="example set output" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process>
1 -
Hi @yyhuang
thanks, this was okay but not completely as expected since it needed to be documents first.
Hi @IngoRM
thank you very much. This helped me a lot
Is it possible to only take the year and if there is no year in the attribute title i just leave it empty?
Probably i have to run another Replace Operator and filter for something like that: [a-zA-Z] , right?
In the next step i have to generate a new attribute age and calculate the age by today minus the year attribute which i calculated extracted above. Is there a simple way for that? Or just an idea?
Thank you very much - you guys helped me a lot.
Best Chris0 -
Hi Chris,The process below takes care of cases without a year (the extracted value is then missing) and calculates the age.Hope this helps,Ingo
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="UTF-8"/><br> <process expanded="true"><br> <operator activated="true" class="utility:create_exampleset" compatibility="9.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34"><br> <parameter key="generator_type" value="comma separated text"/><br> <parameter key="number_of_examples" value="100"/><br> <parameter key="use_stepsize" value="false"/><br> <list key="function_descriptions"/><br> <parameter key="add_id_attribute" value="false"/><br> <list key="numeric_series_configuration"/><br> <list key="date_series_configuration"/><br> <list key="date_series_configuration (interval)"/><br> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/><br> <parameter key="time_zone" value="SYSTEM"/><br> <parameter key="input_csv_text" value="Title That is the 1st test attribute 2019 but not the last one Another title - this time containing the year 2015 2001 was a fantastic year and a strange movie This is a title without any year And this here contains the year 1977 but also the number 42 Finally here we have two years - 2008 and 2009"/><br> <parameter key="column_separator" value=","/><br> <parameter key="parse_all_as_nominal" value="false"/><br> <parameter key="decimal_point_character" value="."/><br> <parameter key="trim_attribute_names" value="true"/><br> </operator><br> <operator activated="true" class="generate_copy" compatibility="9.2.000" expanded="true" height="82" name="Generate Copy" width="90" x="179" y="34"><br> <parameter key="attribute_name" value="Title"/><br> <parameter key="new_name" value="Year"/><br> </operator><br> <operator activated="true" class="replace" compatibility="9.2.000" expanded="true" height="82" name="Replace" width="90" x="313" y="34"><br> <parameter key="attribute_filter_type" value="single"/><br> <parameter key="attribute" value="Year"/><br> <parameter key="attributes" value=""/><br> <parameter key="use_except_expression" value="false"/><br> <parameter key="value_type" value="nominal"/><br> <parameter key="use_value_type_exception" value="false"/><br> <parameter key="except_value_type" value="file_path"/><br> <parameter key="block_type" value="single_value"/><br> <parameter key="use_block_type_exception" value="false"/><br> <parameter key="except_block_type" value="single_value"/><br> <parameter key="invert_selection" value="false"/><br> <parameter key="include_special_attributes" value="false"/><br> <parameter key="replace_what" value=".*(\d{4}).*"/><br> <parameter key="replace_by" value="$1"/><br> </operator><br> <operator activated="true" class="replace" compatibility="9.2.000" expanded="true" height="82" name="Replace (2)" width="90" x="447" y="34"><br> <parameter key="attribute_filter_type" value="single"/><br> <parameter key="attribute" value="Year"/><br> <parameter key="attributes" value=""/><br> <parameter key="use_except_expression" value="false"/><br> <parameter key="value_type" value="nominal"/><br> <parameter key="use_value_type_exception" value="false"/><br> <parameter key="except_value_type" value="file_path"/><br> <parameter key="block_type" value="single_value"/><br> <parameter key="use_block_type_exception" value="false"/><br> <parameter key="except_block_type" value="single_value"/><br> <parameter key="invert_selection" value="false"/><br> <parameter key="include_special_attributes" value="false"/><br> <parameter key="replace_what" value="[^\d]"/><br> </operator><br> <operator activated="true" class="parse_numbers" compatibility="9.2.000" expanded="true" height="82" name="Parse Numbers" width="90" x="581" y="34"><br> <parameter key="attribute_filter_type" value="single"/><br> <parameter key="attribute" value="Year"/><br> <parameter key="attributes" value=""/><br> <parameter key="use_except_expression" value="false"/><br> <parameter key="value_type" value="nominal"/><br> <parameter key="use_value_type_exception" value="false"/><br> <parameter key="except_value_type" value="file_path"/><br> <parameter key="block_type" value="single_value"/><br> <parameter key="use_block_type_exception" value="false"/><br> <parameter key="except_block_type" value="single_value"/><br> <parameter key="invert_selection" value="false"/><br> <parameter key="include_special_attributes" value="false"/><br> <parameter key="decimal_character" value="."/><br> <parameter key="grouped_digits" value="false"/><br> <parameter key="grouping_character" value=","/><br> <parameter key="infinity_representation" value=""/><br> <parameter key="unparsable_value_handling" value="fail"/><br> </operator><br> <operator activated="true" class="generate_attributes" compatibility="9.2.000" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="34"><br> <list key="function_descriptions"><br> <parameter key="Age" value="date_get(date_now(),DATE_UNIT_YEAR) - [Year]"/><br> </list><br> <parameter key="keep_all" value="true"/><br> </operator><br> <connect from_op="Create ExampleSet" from_port="output" to_op="Generate Copy" to_port="example set input"/><br> <connect from_op="Generate Copy" from_port="example set output" to_op="Replace" to_port="example set input"/><br> <connect from_op="Replace" from_port="example set output" to_op="Replace (2)" to_port="example set input"/><br> <connect from_op="Replace (2)" from_port="example set output" to_op="Parse Numbers" to_port="example set input"/><br> <connect from_op="Parse Numbers" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/><br> <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process>
1 -
Hi Ingo,
thank you very much, this worked perfect.
Have a nice weekend,
Chris1