strange behavior of replace tokens operator
simon_knoll
New Altair Community Member
hello all,
im having a workflow containing a create document operator and a process documents operator.
the process documents operator entails a tokenizer and a replace tokens operator.
the replace tokens operator has following rules.
replace est with Eastern_Time
replace dup with duplicates
and hello with hallo
the process documents vector creation is set to term occourences.
the create documents text is :
est
dup
hello
the created wordvector eintails now
Eastern_Time
duplicate
hallo
and now comes the strange thing:
Eastern_Time and duplicate have occourence 0 and hallo has occourence 1
i expected a vector where every of the terms has occourence 1
if im exchanging the process documents operator with the process documents from files operator and i write the words
est
dup
hello
in a text file i get the expected beavior with a vector entailing
Eastern_Time
duplicate
hallo
and every term has an occourence of 1
is this a bug?
am i doing something wrong?
all the best
simon
ps: here the workflow with read document
im having a workflow containing a create document operator and a process documents operator.
the process documents operator entails a tokenizer and a replace tokens operator.
the replace tokens operator has following rules.
replace est with Eastern_Time
replace dup with duplicates
and hello with hallo
the process documents vector creation is set to term occourences.
the create documents text is :
est
dup
hello
the created wordvector eintails now
Eastern_Time
duplicate
hallo
and now comes the strange thing:
Eastern_Time and duplicate have occourence 0 and hallo has occourence 1
i expected a vector where every of the terms has occourence 1
if im exchanging the process documents operator with the process documents from files operator and i write the words
est
dup
hello
in a text file i get the expected beavior with a vector entailing
Eastern_Time
duplicate
hallo
and every term has an occourence of 1
is this a bug?
am i doing something wrong?
all the best
simon
ps: here the workflow with read document
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Process">
<process expanded="true" height="811" width="435">
<operator activated="true" class="text:create_document" compatibility="5.0.6" expanded="true" height="60" name="Create Document (8)" width="90" x="45" y="30">
<parameter key="text" value="est dup hello"/>
<parameter key="label_value" value="jmol"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.0.6" expanded="true" height="94" name="Process Documents (3)" width="90" x="315" y="30">
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="datamanagement" value="double_array"/>
<process expanded="true" height="811" width="1068">
<operator activated="true" class="text:tokenize" compatibility="5.0.7" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
<operator activated="true" class="text:replace_tokens" compatibility="5.0.6" expanded="true" height="60" name="Replace Tokens" width="90" x="514" y="30">
<list key="replace_dictionary">
<parameter key="est" value="Eastern_Time"/>
<parameter key="dup" value="duplicate"/>
<parameter key="hello" value="hallo"/>
</list>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Replace Tokens" to_port="document"/>
<connect from_op="Replace Tokens" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Create Document (8)" from_port="output" to_op="Process Documents (3)" to_port="documents 1"/>
<connect from_op="Process Documents (3)" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="90"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
-
Hi,
thanks for this detailed report. I have found the problem: The Documents delivered to the input ports were directly delivered to the inner process. Since the inner process is passed twice by each document, they where tokenized and replaced two times. Make a break point before the tokenize operator to see this effect.
I have corrected this, it will be delivered with the next regular update.
Greetings,
Sebastian0 -
hi sebastian,
i was searching today for a workaround and i tried this within the DocumentTextInputOperator:@Override
at first sight it worked out, but i think if im doing like that, im messing it up, do you have an advice for a hotfix, as i need this feature really urgent
protected Iterator<Document> getTextObjects() {
List<Document> documents = documentInput.getData(true);
ArrayList<Document> the_documents = new ArrayList<Document>();
for (Document document : documents) {
Document myDocument = new Document(document.getText());
myDocument.addMetaData(document);
the_documents.add(myDocument);
}
return the_documents.iterator();
}
all the best,
simon
0 -
Hi,
try using this:@Override
Think about getting enterprise customer, then you already would have a new release
protected Iterator<Document> getTextObjects() {
List<Document> documents = documentInput.getData(true);
List<Document> clonedDocuments = new ArrayList<Document>(documents.size());
for (Document document: documents) {
clonedDocuments.add(new Document(document.getTokenSequence(), document));
}
return clonedDocuments.iterator();
}
Greetings,
Sebastian0 -
Hi Sebastian
Thank you!!!
i'll give a try.
all the best, simon0