🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

how to process multiple MS Word into Rapidminer?

kevinaceUser: "kevinace"
New Altair Community Member
Updated by Jocelyn
Dear All
I want to process multiple MS Word files.

If I use 'Process Documents from Files' as per the tutorial, the file content looks corrupted. For example, file name: helloworld.docx, with the content of only 2 words: hello world. Rapidminer will produce a trunk of unrelated words as output.
I understand I can use 'read office file' to read the MS Word documents into exact content, however, this extension can use for 1 file at a time only. 
How do I mingle between these 2 processing tools or if there are additional tools I could use? Because either I do 'read office file -> process documents from files -> res' OR 'process documents from files -> read office file -> rex' does not seems computer logic. 

My ideal objective is to load a batch of MS Word files for Readability analysis. Such as using SMOG, FOG etc indexes to check the readability of mass contents, so I can gather more data samples for a university research paper. 

Thanks a lot!

Find more posts tagged with

Sort by:
1 - 5 of 51
    Hi,
    Loop Files + Read Office are the two operators you need to combine.

    Best,
    Martin
    Dear Martin

    how do i setup the parameters for 'loop file' operator to load multiple MS Word into Rapidminer?
    The setting i did is 'loop file' - 'read office file' - rest
    Loop file: 
    Directory: C:/Users/user/Downloads/t1
    filter type: Glob
    Filter by glob: .*doc
    Enable parallel execution

    if filter by glob is .*doc, "not enough iterations: the minimum number of iterations must not be smaller than 1. 
    if filter by glob is: *.doc, error type: input is missing, the previous operator loop file did not product any output.
    There are 3 files in the t1 folder, 2 .doc file and a .docx file

    I also looked up on google how to use Loop File, however the 2018 youtube videos parameter setting seems no longer valid with the current version.... 
    Looking forward for your replies 

    With thanks!

    Kevin

    Hi,
    don't use glob but regex, that should do the trick :)

    Best,
    Martin
    kevinaceUser: "kevinace"
    New Altair Community Member
    OP
    Dear Martin

    I tried with what we discussed, what's still missing?
    Please see screenshot attached, thanks. 
    read office file parameter is default with detect file type. thanks.

    (There are only 2 doc files in the t1 folder)




    Hi,
    you want to put the read inside the loop files. Attached is an example

    Best,
    Martin

    <?xml version="1.0" encoding="UTF-8"?><process version="9.8.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.8.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="concurrency:loop_files" compatibility="9.8.000" expanded="true" height="82" name="Loop Files" width="90" x="514" y="34">
            <parameter key="filter_type" value="regex"/>
            <parameter key="filter_by_regex" value=".*docx"/>
            <parameter key="recursive" value="false"/>
            <parameter key="enable_macros" value="false"/>
            <parameter key="macro_for_file_name" value="file_name"/>
            <parameter key="macro_for_file_type" value="file_type"/>
            <parameter key="macro_for_folder_name" value="folder_name"/>
            <parameter key="reuse_results" value="false"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="operator_toolbox:read_word_files" compatibility="2.8.000-SNAPSHOT" expanded="true" height="68" name="Read Office File" width="90" x="246" y="34">
                <parameter key="detect_file_type" value="true"/>
                <parameter key="file_extension" value="docx"/>
              </operator>
              <connect from_port="file object" to_op="Read Office File" to_port="file"/>
              <connect from_op="Read Office File" from_port="doc" to_port="output 1"/>
              <portSpacing port="source_file object" spacing="0"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
            <description align="center" color="transparent" colored="false" width="126">Add directory here</description>
          </operator>
          <operator activated="true" class="text:documents_to_data" compatibility="9.3.001" expanded="true" height="82" name="Documents to Data" width="90" x="715" y="34">
            <parameter key="add_meta_information" value="true"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
            <parameter key="use_processed_text" value="false"/>
          </operator>
          <connect from_op="Loop Files" from_port="output 1" to_op="Documents to Data" to_port="documents 1"/>
          <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>