[SOLVED] Text Processing - Tokenize: keep word order
CharlieFirpo
New Altair Community Member
Dear All!
Can anybody help me to do a text tokenization in a way that remains the original word order?
I have a sample text like: "delta gamma alpha beta" I use a Process Documents operator and a Tokenize operator in it. I create a word vector that will be an example set after a WordList to Data operator. And unfortunately this result is an alphabetically ordered list, so 'alpha; beta; gamma; delta' [first, second, third, fourth rows]. I want the original word order, so an example set, where the first example is 'delta', second is 'gamma', third is 'alpha', fourth is 'beta'. Without the WordList to Data operator, I have a WordList that is also an alphabetically ordered list.
Of course this can be solved with a Loop operator in a difficult way, but this is not powerful.
So how can I tokenize in a way that remains the original word order?
Thank you!!
Can anybody help me to do a text tokenization in a way that remains the original word order?
I have a sample text like: "delta gamma alpha beta" I use a Process Documents operator and a Tokenize operator in it. I create a word vector that will be an example set after a WordList to Data operator. And unfortunately this result is an alphabetically ordered list, so 'alpha; beta; gamma; delta' [first, second, third, fourth rows]. I want the original word order, so an example set, where the first example is 'delta', second is 'gamma', third is 'alpha', fourth is 'beta'. Without the WordList to Data operator, I have a WordList that is also an alphabetically ordered list.
Of course this can be solved with a Loop operator in a difficult way, but this is not powerful.
So how can I tokenize in a way that remains the original word order?
Thank you!!
Tagged:
0
Answers
-
Hello CharlieFirpo
How about the following<?xml version="1.0" encoding="UTF-8" standalone="no"?>
regards
<process version="6.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document" width="90" x="112" y="75">
<parameter key="text" value="delta gamma beta alpha delta eta alpha "/>
</operator>
<operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document" width="90" x="112" y="165">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="text" value="(\S+)"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="179" y="30"/>
<connect from_port="segment" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="5.3.002" expanded="true" height="76" name="Documents to Data" width="90" x="246" y="75">
<parameter key="text_attribute" value="text"/>
<parameter key="add_meta_information" value="false"/>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Andrew0 -
Thank you!
It works perfectly! I changed the 'mode' parameter at Tokenize operator in Cut Document to 'specify characters = . ,;:' in order to handle numbers as well at the input text.
Nice day!
0