hello all,
im having a workflow containing a create document operator and a process documents operator.
the process documents operator entails a tokenizer and a replace tokens operator.
the replace tokens operator has following rules.
replace est with Eastern_Time
replace dup with duplicates
and hello with hallo
the process documents vector creation is set to term occourences.
the create documents text is :
est
dup
hello
the created wordvector eintails now
Eastern_Time
duplicate
hallo
and now comes the strange thing:
Eastern_Time and duplicate have occourence 0 and hallo has occourence 1
i expected a vector where every of the terms has occourence 1
if im exchanging the process documents operator with the process documents from files operator and i write the words
est
dup
hello
in a text file i get the expected beavior with a vector entailing
Eastern_Time
duplicate
hallo
and every term has an occourence of 1
is this a bug?
am i doing something wrong?
all the best
simon
ps: here the workflow with read document
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Process">
<process expanded="true" height="811" width="435">
<operator activated="true" class="text:create_document" compatibility="5.0.6" expanded="true" height="60" name="Create Document (8)" width="90" x="45" y="30">
<parameter key="text" value="est dup hello"/>
<parameter key="label_value" value="jmol"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.0.6" expanded="true" height="94" name="Process Documents (3)" width="90" x="315" y="30">
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="datamanagement" value="double_array"/>
<process expanded="true" height="811" width="1068">
<operator activated="true" class="text:tokenize" compatibility="5.0.7" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
<operator activated="true" class="text:replace_tokens" compatibility="5.0.6" expanded="true" height="60" name="Replace Tokens" width="90" x="514" y="30">
<list key="replace_dictionary">
<parameter key="est" value="Eastern_Time"/>
<parameter key="dup" value="duplicate"/>
<parameter key="hello" value="hallo"/>
</list>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Replace Tokens" to_port="document"/>
<connect from_op="Replace Tokens" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Create Document (8)" from_port="output" to_op="Process Documents (3)" to_port="documents 1"/>
<connect from_op="Process Documents (3)" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="90"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>