how to tokenize documents?
I'm wondering if it's feasible to tokenize documents, as the tokenize operator itself only offers the option to tokenize on expressions and the likes.
As an example consider the following scenario : a collection of similar documents in a folder is loaded and combined in a single document using the combine document operator. Using the extract token number operator shows there are indeed n tokens in the document (where each token represents a loaded document) but there seems to be no option to loop through these tokens afterwards, or option to split again by token later in the process.
Is this indeed not possible or is there some cool but not so very visible option available that would allow me to tokenize on combined documents?
As an example consider the following scenario : a collection of similar documents in a folder is loaded and combined in a single document using the combine document operator. Using the extract token number operator shows there are indeed n tokens in the document (where each token represents a loaded document) but there seems to be no option to loop through these tokens afterwards, or option to split again by token later in the process.
Is this indeed not possible or is there some cool but not so very visible option available that would allow me to tokenize on combined documents?
Find more posts tagged with
Sort by:
1 - 11 of
111
Hi,
if i understand this correctly, you may have something like
If yes, an easy way to do this would be the Split into Collection operator in Operator Toolbox. @tftemme 's original use case was JSON parsing, so you get each item of an array as a document.
I think there are some ways of doing this with reguar regions in Cut Document as well.
Best,
Martin
if i understand this correctly, you may have something like
this is one text ; this is another text ; and here is yet another ;and you want to break this up into separate documents, splitted on ;?
If yes, an easy way to do this would be the Split into Collection operator in Operator Toolbox. @tftemme 's original use case was JSON parsing, so you get each item of an array as a document.
I think there are some ways of doing this with reguar regions in Cut Document as well.
Best,
Martin
Hi @mschmitz, not entirely. Splitting up on sentences would be fairly easy to do, but I am talking about 'full blown' documents that can contain multiple sentences and paragraphs so the typical line break / new line logic wouldn't work here.
When looking at the document result rapidminer shows the distinction between the documents (as the color is different from doc 2 doc) but there is no real way to get the actual tokens / separate docs. Unless there is this super secret special symbol used by rapidminer to distinguish where one doc (token) ends and another one starts.
When looking at the document result rapidminer shows the distinction between the documents (as the color is different from doc 2 doc) but there is no real way to get the actual tokens / separate docs. Unless there is this super secret special symbol used by rapidminer to distinguish where one doc (token) ends and another one starts.
Hey @kayman ,
The Split into Collection operator has a setting to define the token. if you set it to your tokenize token it should work?
BR,
Martin
Hi @mschmitz, this still requires you to know what is actually separating the tokens, unless I am missing something.
Find attached simplified example :
Just 2 no-sense documents stitched together for the exercise, but I want to split them again as document. The document outcome shows the two documents as separate tokens, but splitting them into actual docs isn't that straightforward it seems. I've tried a variety of regex expressions but they always provide me either single sentences, or nothing, but not the total document.
Find attached simplified example :
<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="UTF-8"/> <process expanded="true"> <operator activated="true" class="text:create_document" compatibility="8.2.000" expanded="true" height="68" name="Create Document" width="90" x="179" y="34"> <parameter key="text" value="This is document 1 contains some doc 1 blablabla"/> <parameter key="add label" value="false"/> <parameter key="label_type" value="nominal"/> </operator> <operator activated="true" class="text:create_document" compatibility="8.2.000" expanded="true" height="68" name="Create Document (2)" width="90" x="179" y="136"> <parameter key="text" value="This is document 2 contains some doc 2 blablabla"/> <parameter key="add label" value="false"/> <parameter key="label_type" value="nominal"/> </operator> <operator activated="true" class="text:combine_documents" compatibility="8.2.000" expanded="true" height="103" name="Combine Documents" width="90" x="380" y="34"/> <operator activated="true" class="operator_toolbox:split_document_into_collection" compatibility="2.1.000" expanded="true" height="82" name="Split Document into Collection" width="90" x="514" y="34"> <parameter key="split_string" value="\n"/> </operator> <connect from_op="Create Document" from_port="output" to_op="Combine Documents" to_port="documents 1"/> <connect from_op="Create Document (2)" from_port="output" to_op="Combine Documents" to_port="documents 2"/> <connect from_op="Combine Documents" from_port="document" to_op="Split Document into Collection" to_port="document"/> <connect from_op="Split Document into Collection" from_port="collection" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Just 2 no-sense documents stitched together for the exercise, but I want to split them again as document. The document outcome shows the two documents as separate tokens, but splitting them into actual docs isn't that straightforward it seems. I've tried a variety of regex expressions but they always provide me either single sentences, or nothing, but not the total document.
Hi @kayman ,
now i understand. There is nothing to split on, but you want to export the tokens itself. Something like this groovy script i guess?
<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="UTF-8"/><br> <process expanded="true"><br> <operator activated="true" class="text:create_document" compatibility="9.1.000-SNAPSHOT" expanded="true" height="68" name="Create Document" width="90" x="179" y="34"><br> <parameter key="text" value="This is document 1 contains some doc 1 blablabla"/><br> <parameter key="add label" value="false"/><br> <parameter key="label_type" value="nominal"/><br> </operator><br> <operator activated="true" class="text:create_document" compatibility="9.1.000-SNAPSHOT" expanded="true" height="68" name="Create Document (2)" width="90" x="179" y="136"><br> <parameter key="text" value="This is document 2 contains some doc 2 blablabla"/><br> <parameter key="add label" value="false"/><br> <parameter key="label_type" value="nominal"/><br> </operator><br> <operator activated="true" class="text:combine_documents" compatibility="9.1.000-SNAPSHOT" expanded="true" height="103" name="Combine Documents" width="90" x="380" y="34"/><br> <operator activated="true" class="execute_script" compatibility="9.3.001" expanded="true" height="82" name="Execute Script" width="90" x="648" y="34"><br> <parameter key="script" value="import java.util.ArrayList; import java.util.List; import com.rapidminer.operator.Operator; import com.rapidminer.operator.OperatorDescription; import com.rapidminer.operator.OperatorException; import com.rapidminer.operator.ports.InputPortExtender; import com.rapidminer.operator.ports.OutputPort; import com.rapidminer.operator.text.Document; import com.rapidminer.operator.text.Token; Document d = input[0]; IOObjectCollection<Document> result = new IOObjectCollection<>(); for( Token t : d.getTokenSequence()){ 	result.add(new Document( t.getToken())); } // You can add any code here // This line returns the first input as the first output return result;"/><br> <parameter key="standard_imports" value="true"/><br> </operator><br> <operator activated="false" class="operator_toolbox:split_document_into_collection" compatibility="2.3.000-SNAPSHOT" expanded="true" height="82" name="Split Document into Collection" width="90" x="648" y="340"><br> <parameter key="split_string" value="\n"/><br> </operator><br> <connect from_op="Create Document" from_port="output" to_op="Combine Documents" to_port="documents 1"/><br> <connect from_op="Create Document (2)" from_port="output" to_op="Combine Documents" to_port="documents 2"/><br> <connect from_op="Combine Documents" from_port="document" to_op="Execute Script" to_port="input 1"/><br> <connect from_op="Execute Script" from_port="output 1" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process><br><br>
Thanks @mschmitz, not as out-of-the-box I was looking for but it does do the trick indeed.
@kayman so if I understand correctly, you're actually creating these documents by using Loop Files -> Read Document like this?

If so, I believe Loop Files actually inserts a space between docs (don't ask me why!) so you can just do this:
The folder I used for this is attached below so you can match my process. You'll have to change the path of Loop Files of course.
Scott

If so, I believe Loop Files actually inserts a space between docs (don't ask me why!) so you can just do this:
<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="UTF-8"/> <process expanded="true"> <operator activated="true" class="concurrency:loop_files" compatibility="9.3.001" expanded="true" height="82" name="Loop Files" width="90" x="112" y="85"> <parameter key="directory" value="/Users/genzerconsulting/Desktop/documents"/> <parameter key="filter_type" value="regex"/> <parameter key="filter_by_regex" value=".*.txt"/> <parameter key="recursive" value="false"/> <parameter key="enable_macros" value="false"/> <parameter key="macro_for_file_name" value="file_name"/> <parameter key="macro_for_file_type" value="file_type"/> <parameter key="macro_for_folder_name" value="folder_name"/> <parameter key="reuse_results" value="false"/> <parameter key="enable_parallel_execution" value="true"/> <process expanded="true"> <operator activated="true" class="text:read_document" compatibility="8.2.000" expanded="true" height="68" name="Read Document" width="90" x="112" y="34"> <parameter key="extract_text_only" value="true"/> <parameter key="use_file_extension_as_type" value="true"/> <parameter key="content_type" value="txt"/> <parameter key="encoding" value="SYSTEM"/> </operator> <connect from_port="file object" to_op="Read Document" to_port="file"/> <connect from_op="Read Document" from_port="output" to_port="output 1"/> <portSpacing port="source_file object" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <operator activated="true" class="text:combine_documents" compatibility="8.2.000" expanded="true" height="82" name="Combine Documents" width="90" x="246" y="85"/> <operator activated="true" class="operator_toolbox:split_document_into_collection" compatibility="2.1.000" expanded="true" height="82" name="Split Document into Collection" width="90" x="380" y="85"> <parameter key="split_string" value="\n\s"/> </operator> <connect from_op="Loop Files" from_port="output 1" to_op="Combine Documents" to_port="documents 1"/> <connect from_op="Combine Documents" from_port="document" to_op="Split Document into Collection" to_port="document"/> <connect from_op="Split Document into Collection" from_port="collection" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
The folder I used for this is attached below so you can match my process. You'll have to change the path of Loop Files of course.
Scott
Thanks @sgenzer , though it seems to work for external docs it's a bit tricky if there would be new lines followed by a space in a given document. Happens quite a lot with html for instance so not really an option.
Sort by:
1 - 1 of
11
Hi @kayman ,
now i understand. There is nothing to split on, but you want to export the tokens itself. Something like this groovy script i guess?
<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="UTF-8"/><br> <process expanded="true"><br> <operator activated="true" class="text:create_document" compatibility="9.1.000-SNAPSHOT" expanded="true" height="68" name="Create Document" width="90" x="179" y="34"><br> <parameter key="text" value="This is document 1 contains some doc 1 blablabla"/><br> <parameter key="add label" value="false"/><br> <parameter key="label_type" value="nominal"/><br> </operator><br> <operator activated="true" class="text:create_document" compatibility="9.1.000-SNAPSHOT" expanded="true" height="68" name="Create Document (2)" width="90" x="179" y="136"><br> <parameter key="text" value="This is document 2 contains some doc 2 blablabla"/><br> <parameter key="add label" value="false"/><br> <parameter key="label_type" value="nominal"/><br> </operator><br> <operator activated="true" class="text:combine_documents" compatibility="9.1.000-SNAPSHOT" expanded="true" height="103" name="Combine Documents" width="90" x="380" y="34"/><br> <operator activated="true" class="execute_script" compatibility="9.3.001" expanded="true" height="82" name="Execute Script" width="90" x="648" y="34"><br> <parameter key="script" value="import java.util.ArrayList; import java.util.List; import com.rapidminer.operator.Operator; import com.rapidminer.operator.OperatorDescription; import com.rapidminer.operator.OperatorException; import com.rapidminer.operator.ports.InputPortExtender; import com.rapidminer.operator.ports.OutputPort; import com.rapidminer.operator.text.Document; import com.rapidminer.operator.text.Token; Document d = input[0]; IOObjectCollection<Document> result = new IOObjectCollection<>(); for( Token t : d.getTokenSequence()){ 	result.add(new Document( t.getToken())); } // You can add any code here // This line returns the first input as the first output return result;"/><br> <parameter key="standard_imports" value="true"/><br> </operator><br> <operator activated="false" class="operator_toolbox:split_document_into_collection" compatibility="2.3.000-SNAPSHOT" expanded="true" height="82" name="Split Document into Collection" width="90" x="648" y="340"><br> <parameter key="split_string" value="\n"/><br> </operator><br> <connect from_op="Create Document" from_port="output" to_op="Combine Documents" to_port="documents 1"/><br> <connect from_op="Create Document (2)" from_port="output" to_op="Combine Documents" to_port="documents 2"/><br> <connect from_op="Combine Documents" from_port="document" to_op="Execute Script" to_port="input 1"/><br> <connect from_op="Execute Script" from_port="output 1" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process><br><br>
@mschmitz is there another solution here?