🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

how to tokenize documents?

User: "kayman"
New Altair Community Member
Updated by Jocelyn
I'm wondering if it's feasible to tokenize documents, as the tokenize operator itself only offers the option to tokenize on expressions and the likes.

As an example consider the following scenario : a collection of similar documents in a folder is loaded and combined in a single document using the combine document operator. Using the extract token number operator shows there are indeed n tokens in the document (where each token represents a loaded document) but there seems to be no option to loop through these tokens afterwards, or option to split again by token later in the process.

Is this indeed not possible or is there some cool but not so very visible option available that would allow me to tokenize on combined documents?

Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "MartinLiebig"
    Altair Employee
    Accepted Answer
    Hi @kayman ,
    now i understand. There is nothing to split on, but you want to export the tokens itself. Something like this groovy script i guess?

    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.001"><br>  <context><br>    <input/><br>    <output/><br>    <macros/><br>  </context><br>  <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process"><br>    <parameter key="logverbosity" value="init"/><br>    <parameter key="random_seed" value="2001"/><br>    <parameter key="send_mail" value="never"/><br>    <parameter key="notification_email" value=""/><br>    <parameter key="process_duration_for_mail" value="30"/><br>    <parameter key="encoding" value="UTF-8"/><br>    <process expanded="true"><br>      <operator activated="true" class="text:create_document" compatibility="9.1.000-SNAPSHOT" expanded="true" height="68" name="Create Document" width="90" x="179" y="34"><br>        <parameter key="text" value="This is document 1&#10;contains some doc 1 blablabla"/><br>        <parameter key="add label" value="false"/><br>        <parameter key="label_type" value="nominal"/><br>      </operator><br>      <operator activated="true" class="text:create_document" compatibility="9.1.000-SNAPSHOT" expanded="true" height="68" name="Create Document (2)" width="90" x="179" y="136"><br>        <parameter key="text" value="This is document 2&#10;contains some doc 2 blablabla"/><br>        <parameter key="add label" value="false"/><br>        <parameter key="label_type" value="nominal"/><br>      </operator><br>      <operator activated="true" class="text:combine_documents" compatibility="9.1.000-SNAPSHOT" expanded="true" height="103" name="Combine Documents" width="90" x="380" y="34"/><br>      <operator activated="true" class="execute_script" compatibility="9.3.001" expanded="true" height="82" name="Execute Script" width="90" x="648" y="34"><br>        <parameter key="script" value="import java.util.ArrayList;&#10;import java.util.List;&#10;&#10;import com.rapidminer.operator.Operator;&#10;import com.rapidminer.operator.OperatorDescription;&#10;import com.rapidminer.operator.OperatorException;&#10;import com.rapidminer.operator.ports.InputPortExtender;&#10;import com.rapidminer.operator.ports.OutputPort;&#10;import com.rapidminer.operator.text.Document;&#10;import com.rapidminer.operator.text.Token;&#10;&#10;Document d = input[0];&#10;&#10;IOObjectCollection&lt;Document&gt; result = new IOObjectCollection&lt;&gt;();&#10;for( Token t : d.getTokenSequence()){&#10;&#9;result.add(new Document( t.getToken()));&#10;}&#10;&#10;// You can add any code here&#10;&#10;&#10;// This line returns the first input as the first output&#10;return result;"/><br>        <parameter key="standard_imports" value="true"/><br>      </operator><br>      <operator activated="false" class="operator_toolbox:split_document_into_collection" compatibility="2.3.000-SNAPSHOT" expanded="true" height="82" name="Split Document into Collection" width="90" x="648" y="340"><br>        <parameter key="split_string" value="\n"/><br>      </operator><br>      <connect from_op="Create Document" from_port="output" to_op="Combine Documents" to_port="documents 1"/><br>      <connect from_op="Create Document (2)" from_port="output" to_op="Combine Documents" to_port="documents 2"/><br>      <connect from_op="Combine Documents" from_port="document" to_op="Execute Script" to_port="input 1"/><br>      <connect from_op="Execute Script" from_port="output 1" to_port="result 1"/><br>      <portSpacing port="source_input 1" spacing="0"/><br>      <portSpacing port="sink_result 1" spacing="0"/><br>      <portSpacing port="sink_result 2" spacing="0"/><br>    </process><br>  </operator><br></process><br><br>