"Loading Adobe/Word into Rapidminer"
Hi All,
I want to load some Adobe documents into Rapidminer so I can calculate word frequencies. I am able to do this with Excel sheets but can't seem to load the Adobe doc into it. Please let me know what operators I need to load either Adobe or Word docs into Rapidminer to calculate word frequencies.
Thanks.
Find more posts tagged with
Sort by:
1 - 6 of
61
Actually reading DOCX is supported as well. Please see this sample process.
<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="open_file" compatibility="7.4.000" expanded="true" height="68" name="Open File" width="90" x="246" y="187">
<parameter key="filename" value="C:\Users\think\Documents\MyWordDoc.docx"/>
</operator>
<operator activated="true" class="loop_zipfile_entries" compatibility="7.4.000" expanded="true" height="82" name="Read Word Document" width="90" x="581" y="187">
<parameter key="internal_directory" value="word"/>
<parameter key="filter" value="document\.xml"/>
<process expanded="true">
<operator activated="true" class="text:read_document" compatibility="7.4.001" expanded="true" height="68" name="Read Document" width="90" x="179" y="238">
<parameter key="content_type" value="xml"/>
</operator>
<connect from_port="file object" to_op="Read Document" to_port="file"/>
<connect from_op="Read Document" from_port="output" to_port="out 1"/>
<portSpacing port="source_file object" spacing="0"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<connect from_op="Open File" from_port="file" to_op="Read Word Document" to_port="file"/>
<connect from_op="Read Word Document" from_port="out 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Huh, will you look at that. You taught me a new trick @JEdward! Thanks!
You can load PDF, TXT, HTML, and XML files only. DOCX is not supported.