"writting a collection of documents"
mohammadreza
New Altair Community Member
Hi all,
I read an XML file in my process and convert it to a collection of documents in memory. Now I need to write each document as a separate file. Is there any way to do that? (I cam think of using "Write Document" in a loop but I can't figure out the right way to do that).
Best
I read an XML file in my process and convert it to a collection of documents in memory. Now I need to write each document as a separate file. Is there any way to do that? (I cam think of using "Write Document" in a loop but I can't figure out the right way to do that).
Best
Tagged:
0
Answers
-
Hi mohammedreza,
what about either Document to Data or Combine Documents first?0 -
Hi Martin,
The data is already combined in one big XML file so I am trying to break it down to several files and write them. The only remaining part is just writing the document collection (which is in memory) on hard drive: Here is my process so far:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
Thanks in advance
<process version="5.3.013">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_xml" compatibility="5.3.013" expanded="true" height="60" name="Read XML" width="90" x="45" y="30">
<parameter key="file" value="C:\home\ebrahimi\Anomaly\seg10Train.xml"/>
<parameter key="xpath_for_examples" value="conversations/conversation"/>
<enumeration key="xpaths_for_attributes">
<parameter key="xpath_for_attribute" value="@id"/>
<parameter key="xpath_for_attribute" value="message/text"/>
<parameter key="xpath_for_attribute" value="message/author"/>
</enumeration>
<list key="namespaces"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="5.3.002" expanded="true" height="60" name="Data to Documents" width="90" x="179" y="30">
<parameter key="select_attributes_and_weights" value="true"/>
<list key="specify_weights">
<parameter key="tex" value="1.0"/>
</list>
</operator>
<operator activated="true" class="text:filter_documents_by_content" compatibility="5.3.002" expanded="true" height="76" name="Filter Documents (by Content)" width="90" x="313" y="30">
<parameter key="condition" value="contains match"/>
<parameter key="regular_expression" value="</text><text>"/>
</operator>
<operator activated="true" class="loop_collection" compatibility="5.3.013" expanded="true" height="76" name="Loop Collection" width="90" x="447" y="30">
<process expanded="true">
<operator activated="true" class="text:write_document" compatibility="5.3.002" expanded="true" height="76" name="Write Document" width="90" x="112" y="30"/>
<connect from_port="single" to_op="Write Document" to_port="document"/>
<connect from_op="Write Document" from_port="document" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read XML" from_port="output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Filter Documents (by Content)" to_port="documents 1"/>
<connect from_op="Filter Documents (by Content)" from_port="documents" to_op="Loop Collection" to_port="collection"/>
<connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0 -
This is a very simple example, but the trick is to pass the Write Document operator a filename, but set that filename using Macros.
I say it is a simple example as it just uses the iteration of the loop operator as the filename. I would recommend you use either Extract Macro or Extract Macro from Annotation to get the name of the file you'd like it saved.
You might want to try the ID or the Author or a combination of the two?<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_xml" compatibility="6.0.003" expanded="true" height="60" name="Read XML" width="90" x="45" y="30">
<parameter key="file" value="C:\home\ebrahimi\Anomaly\seg10Train.xml"/>
<parameter key="xpath_for_examples" value="conversations/conversation"/>
<enumeration key="xpaths_for_attributes">
<parameter key="xpath_for_attribute" value="@id"/>
<parameter key="xpath_for_attribute" value="message/text"/>
<parameter key="xpath_for_attribute" value="message/author"/>
</enumeration>
<list key="namespaces"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="6.4.001" expanded="true" height="60" name="Data to Documents" width="90" x="179" y="30">
<parameter key="select_attributes_and_weights" value="true"/>
<list key="specify_weights">
<parameter key="tex" value="1.0"/>
</list>
</operator>
<operator activated="true" class="text:filter_documents_by_content" compatibility="6.4.001" expanded="true" height="76" name="Filter Documents (by Content)" width="90" x="313" y="30">
<parameter key="condition" value="contains match"/>
<parameter key="regular_expression" value="</text><text>"/>
</operator>
<operator activated="true" class="loop_collection" compatibility="6.4.000" expanded="true" height="76" name="Loop Collection" width="90" x="447" y="30">
<parameter key="set_iteration_macro" value="true"/>
<process expanded="true">
<operator activated="true" class="text:write_document" compatibility="6.4.001" expanded="true" height="76" name="Write Document" width="90" x="179" y="30">
<parameter key="file" value="C:\home\ebrahimi\Anomaly\Output\%{iteration}.txt"/>
</operator>
<connect from_port="single" to_op="Write Document" to_port="document"/>
<connect from_op="Write Document" from_port="document" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read XML" from_port="output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Filter Documents (by Content)" to_port="documents 1"/>
<connect from_op="Filter Documents (by Content)" from_port="documents" to_op="Loop Collection" to_port="collection"/>
<connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Hi Edward,
Thanks. As you correctly mentioned, I need to save each file with its own name (id). According to your explanations (using Extract Macro) I came up with the following process, But I do not know what to choose for "example index" parameter of "Extract Macro operator" to be the "id" of each file.<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_xml" compatibility="5.3.013" expanded="true" height="60" name="Read XML" width="90" x="45" y="30">
<parameter key="file" value="C:\home\ebrahimi\Anomaly\seg10Train.xml"/>
<parameter key="xpath_for_examples" value="conversations/conversation"/>
<enumeration key="xpaths_for_attributes">
<parameter key="xpath_for_attribute" value="@id"/>
<parameter key="xpath_for_attribute" value="message/text"/>
<parameter key="xpath_for_attribute" value="message/author"/>
</enumeration>
<list key="namespaces"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="5.3.002" expanded="true" height="60" name="Data to Documents" width="90" x="179" y="30">
<parameter key="select_attributes_and_weights" value="true"/>
<list key="specify_weights">
<parameter key="tex" value="1.0"/>
</list>
</operator>
<operator activated="true" class="text:filter_documents_by_content" compatibility="5.3.002" expanded="true" height="76" name="Filter Documents (by Content)" width="90" x="313" y="30">
<parameter key="condition" value="contains match"/>
<parameter key="regular_expression" value="</text><text>"/>
</operator>
<operator activated="true" class="extract_macro" compatibility="5.3.013" expanded="true" height="60" name="Extract Macro" width="90" x="447" y="30">
<parameter key="macro_type" value="data_value"/>
<list key="additional_macros"/>
</operator>
<operator activated="true" class="loop_collection" compatibility="5.3.013" expanded="true" height="76" name="Loop Collection" width="90" x="648" y="30">
<parameter key="set_iteration_macro" value="true"/>
<process expanded="true">
<operator activated="true" class="text:write_document" compatibility="5.3.002" expanded="true" height="76" name="Write Document" width="90" x="179" y="30">
<parameter key="file" value="C:\home\ebrahimi\Anomaly\Output\%{mine}.txt"/>
</operator>
<connect from_port="single" to_op="Write Document" to_port="document"/>
<connect from_op="Write Document" from_port="document" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read XML" from_port="output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Filter Documents (by Content)" to_port="documents 1"/>
<connect from_op="Filter Documents (by Content)" from_port="documents" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Extract Macro" from_port="example set" to_op="Loop Collection" to_port="collection"/>
<connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Hi,
Extract Macro can just be applied on example sets. So you might go with one big loop examples around and then extract the macro before converting it to a document,
Best
Martin0 -
0