How to create new examples by spliiting at punctuation marks?
New Altair Community Member
Hi all!
I wonder if it is possible to split an example containing text by punctuation marks. I have an exampleset containing some metadata for a text attribute. The text attribute contains many sentences. Here are 2 examples as demonstration:
2012-05-04 Source1 Speaker1 Context1 "The unsettling prospects come at a time of growing uncertainty for the country’s economy. With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06 Source2 Speaker2 Context2 "Already some farmers are watching their cash crops burn to the point of no return. Others have been cutting their corn early to use for feed, a much less profitable venture."
What I want to do is to split the text attribute by e.g. "." while keeping the metadata for every sentence. The result would be 4 examples:
2012-05-04 Source1 Speaker1 Context1 "The unsettling prospects come at a time of growing uncertainty for the country’s economy."
2012-05-04 Source1 Speaker1 Context1 "With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06 Source2 Speaker2 Context2 "Already some farmers are watching their cash crops burn to the point of no return."
2012-05-06 Source2 Speaker2 Context2 "Others have been cutting their corn early to use for feed, a much less profitable venture."
Is there any way to do this? I tried to use tokenization, but it delivers only vectors (i.e. new attributes) but not new examples. If switch off vectorization I can not see any difference in the result set apart from "." beeing deleted in the text attribute.
Any help is very appreciated!
I wonder if it is possible to split an example containing text by punctuation marks. I have an exampleset containing some metadata for a text attribute. The text attribute contains many sentences. Here are 2 examples as demonstration:
2012-05-04 Source1 Speaker1 Context1 "The unsettling prospects come at a time of growing uncertainty for the country’s economy. With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06 Source2 Speaker2 Context2 "Already some farmers are watching their cash crops burn to the point of no return. Others have been cutting their corn early to use for feed, a much less profitable venture."
What I want to do is to split the text attribute by e.g. "." while keeping the metadata for every sentence. The result would be 4 examples:
2012-05-04 Source1 Speaker1 Context1 "The unsettling prospects come at a time of growing uncertainty for the country’s economy."
2012-05-04 Source1 Speaker1 Context1 "With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06 Source2 Speaker2 Context2 "Already some farmers are watching their cash crops burn to the point of no return."
2012-05-06 Source2 Speaker2 Context2 "Others have been cutting their corn early to use for feed, a much less profitable venture."
Is there any way to do this? I tried to use tokenization, but it delivers only vectors (i.e. new attributes) but not new examples. If switch off vectorization I can not see any difference in the result set apart from "." beeing deleted in the text attribute.
Any help is very appreciated!
Hi Chris,
you can use e.g. Cut Documents for this. You may have to tune the regular expression a bit, but the process below depicts the general idea.
~Marius<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="505" width="721">
<operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="45" y="120">
<list key="attribute_values">
<parameter key="meta" value="false"/>
<parameter key="text" value=""This is also a test. With two sentences.""/>
<list key="set_additional_roles"/>
<operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="45" y="30">
<list key="attribute_values">
<parameter key="meta" value="true"/>
<parameter key="text" value=""Test. Sentence. Blubb.""/>
<list key="set_additional_roles"/>
<operator activated="true" class="append" compatibility="5.2.008" expanded="true" height="94" name="Append" width="90" x="179" y="30"/>
<operator activated="true" class="nominal_to_text" compatibility="5.2.008" expanded="true" height="76" name="Nominal to Text" width="90" x="313" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="text"/>
<operator activated="true" class="text:process_document_from_data" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="30">
<parameter key="keep_text" value="true"/>
<list key="specify_weights"/>
<process expanded="true" height="505" width="658">
<operator activated="true" class="text:cut_document" compatibility="5.2.004" expanded="true" height="60" name="Cut Document" width="90" x="112" y="30">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries">
<parameter key="t" value="\..\."/>
<list key="regular_expression_queries">
<parameter key="t" value="([^\.]+)"/>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<process expanded="true" height="523" width="658">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
<connect from_port="document" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
<operator activated="true" class="select_attributes" compatibility="5.2.008" expanded="true" height="76" name="Select Attributes" width="90" x="581" y="30">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="|meta|text"/>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 2"/>
<connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>0 -
Hi Marius,
great, that will do it!
Thanks a lot!