How to create new examples by spliiting at punctuation marks?
chrisniem
New Altair Community Member
Hi all!
I wonder if it is possible to split an example containing text by punctuation marks. I have an exampleset containing some metadata for a text attribute. The text attribute contains many sentences. Here are 2 examples as demonstration:
2012-05-04 Source1 Speaker1 Context1 "The unsettling prospects come at a time of growing uncertainty for the country’s economy. With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06 Source2 Speaker2 Context2 "Already some farmers are watching their cash crops burn to the point of no return. Others have been cutting their corn early to use for feed, a much less profitable venture."
What I want to do is to split the text attribute by e.g. "." while keeping the metadata for every sentence. The result would be 4 examples:
2012-05-04 Source1 Speaker1 Context1 "The unsettling prospects come at a time of growing uncertainty for the country’s economy."
2012-05-04 Source1 Speaker1 Context1 "With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06 Source2 Speaker2 Context2 "Already some farmers are watching their cash crops burn to the point of no return."
2012-05-06 Source2 Speaker2 Context2 "Others have been cutting their corn early to use for feed, a much less profitable venture."
Is there any way to do this? I tried to use tokenization, but it delivers only vectors (i.e. new attributes) but not new examples. If switch off vectorization I can not see any difference in the result set apart from "." beeing deleted in the text attribute.
Any help is very appreciated!
Thanks
Chris
I wonder if it is possible to split an example containing text by punctuation marks. I have an exampleset containing some metadata for a text attribute. The text attribute contains many sentences. Here are 2 examples as demonstration:
2012-05-04 Source1 Speaker1 Context1 "The unsettling prospects come at a time of growing uncertainty for the country’s economy. With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06 Source2 Speaker2 Context2 "Already some farmers are watching their cash crops burn to the point of no return. Others have been cutting their corn early to use for feed, a much less profitable venture."
What I want to do is to split the text attribute by e.g. "." while keeping the metadata for every sentence. The result would be 4 examples:
2012-05-04 Source1 Speaker1 Context1 "The unsettling prospects come at a time of growing uncertainty for the country’s economy."
2012-05-04 Source1 Speaker1 Context1 "With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06 Source2 Speaker2 Context2 "Already some farmers are watching their cash crops burn to the point of no return."
2012-05-06 Source2 Speaker2 Context2 "Others have been cutting their corn early to use for feed, a much less profitable venture."
Is there any way to do this? I tried to use tokenization, but it delivers only vectors (i.e. new attributes) but not new examples. If switch off vectorization I can not see any difference in the result set apart from "." beeing deleted in the text attribute.
Any help is very appreciated!
Thanks
Chris
Tagged:
0
Answers
-
Hi Chris,
you can use e.g. Cut Documents for this. You may have to tune the regular expression a bit, but the process below depicts the general idea.
Best,
~Marius<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="505" width="721">
<operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="45" y="120">
<list key="attribute_values">
<parameter key="meta" value="false"/>
<parameter key="text" value=""This is also a test. With two sentences.""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="45" y="30">
<list key="attribute_values">
<parameter key="meta" value="true"/>
<parameter key="text" value=""Test. Sentence. Blubb.""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="append" compatibility="5.2.008" expanded="true" height="94" name="Append" width="90" x="179" y="30"/>
<operator activated="true" class="nominal_to_text" compatibility="5.2.008" expanded="true" height="76" name="Nominal to Text" width="90" x="313" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="text"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="30">
<parameter key="keep_text" value="true"/>
<list key="specify_weights"/>
<process expanded="true" height="505" width="658">
<operator activated="true" class="text:cut_document" compatibility="5.2.004" expanded="true" height="60" name="Cut Document" width="90" x="112" y="30">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries">
<parameter key="t" value="\..\."/>
</list>
<list key="regular_expression_queries">
<parameter key="t" value="([^\.]+)"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<process expanded="true" height="523" width="658">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="document" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select_attributes" compatibility="5.2.008" expanded="true" height="76" name="Select Attributes" width="90" x="581" y="30">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="|meta|text"/>
</operator>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 2"/>
<connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Hi Marius,
great, that will do it!
Thanks a lot!
Chris0