strange behavior of replace tokens operator

hello all,

im having a workflow containing a create document operator and a process documents operator.
the process documents operator entails a tokenizer and a replace tokens operator.
the replace tokens operator has following rules.

replace est with Eastern_Time
replace dup with duplicates
and hello with hallo

the process documents vector creation is set to term occourences.

the create documents text is :

est
dup
hello

the created wordvector eintails now
Eastern_Time
duplicate
hallo

and now comes the strange thing:
Eastern_Time and duplicate have occourence 0 and hallo has occourence 1

i expected a vector where every of the terms has occourence 1

if im exchanging the process documents operator with the process documents from files operator and i write the words

est
dup
hello

in a text file i get the expected beavior with a vector entailing

Eastern_Time
duplicate
hallo

and every term has an occourence of 1

is this a bug?
am i doing something wrong?

all the best
simon

ps: here the workflow with read document

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Process">
    <process expanded="true" height="811" width="435">
      <operator activated="true" class="text:create_document" compatibility="5.0.6" expanded="true" height="60" name="Create Document (8)" width="90" x="45" y="30">
        <parameter key="text" value="est&#10;dup&#10;hello"/>
        <parameter key="label_value" value="jmol"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.0.6" expanded="true" height="94" name="Process Documents (3)" width="90" x="315" y="30">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="datamanagement" value="double_array"/>
        <process expanded="true" height="811" width="1068">
          <operator activated="true" class="text:tokenize" compatibility="5.0.7" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
          <operator activated="true" class="text:replace_tokens" compatibility="5.0.6" expanded="true" height="60" name="Replace Tokens" width="90" x="514" y="30">
            <list key="replace_dictionary">
              <parameter key="est" value="Eastern_Time"/>
              <parameter key="dup" value="duplicate"/>
              <parameter key="hello" value="hallo"/>
            </list>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Replace Tokens" to_port="document"/>
          <connect from_op="Replace Tokens" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create Document (8)" from_port="output" to_op="Process Documents (3)" to_port="documents 1"/>
      <connect from_op="Process Documents (3)" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="90"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Find more posts tagged with

AI Studio

Accepted answers

All comments

land

Hi,
thanks for this detailed report. I have found the problem: The Documents delivered to the input ports were directly delivered to the inner process. Since the inner process is passed twice by each document, they where tokenized and replaced two times. Make a break point before the tokenize operator to see this effect.
I have corrected this, it will be delivered with the next regular update.

Greetings,
Sebastian

simon_knoll

hi sebastian,
i was searching today for a workaround and i tried this within the DocumentTextInputOperator:

@Override
	protected Iterator<Document> getTextObjects() {
		List<Document> documents = documentInput.getData(true);
		ArrayList<Document> the_documents = new ArrayList<Document>();
		for (Document document : documents) {
			Document myDocument = new Document(document.getText());
			myDocument.addMetaData(document);
			the_documents.add(myDocument);
		}
		return the_documents.iterator();
	}

at first sight it worked out, but i think if im doing like that, im messing it up, do you have an advice for a hotfix, as i need this feature really urgent

all the best,
simon

land

Hi,
try using this:

	@Override
	protected Iterator<Document> getTextObjects() {
		List<Document> documents = documentInput.getData(true);
		List<Document> clonedDocuments = new ArrayList<Document>(documents.size());
		for (Document document: documents) {
			clonedDocuments.add(new Document(document.getTokenSequence(), document));
		}
		return clonedDocuments.iterator();
	}

Think about getting enterprise customer, then you already would have a new release

Greetings,
Sebastian

simon_knoll

Hi Sebastian
Thank you!!!
i'll give a try.

all the best, simon