text mining pdf articles omitting references

mlubicz
mlubicz New Altair Community Member
edited November 5 in Community Q&A
In a previous post https://community.rapidminer.com/discussion/53107/text-mining-of-multiple-pdf-files-with-separate-key-word-counts an approach for mining multiple pdf files was described.
If the pdfs are articles, is there a way to exclude References section from being mined. The section often starts with the same term (i.e. 'References'), so I tried to define some Split or a specific Tokenize option but I failed.
I would be grateful for any suggestion.

Best Answers

  • kayman
    kayman New Altair Community Member
    Answer ✓
    Yeah, an option to filter documents based on content would be nice, but as far as I know it's not available.

    A workaround could be as follows : use the documents to data operator, and filter on the reference keyword. Next convert back to documents (or deal with it as data)

    Attached a very simplified example, might get you started.
    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
            <parameter key="text" value="this one contains reference"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document (2)" width="90" x="112" y="136">
            <parameter key="text" value="this one doesn't"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document (3)" width="90" x="112" y="238">
            <parameter key="text" value="this one doesn't either"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          </operator>
          <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="124" name="Documents to Data" width="90" x="246" y="34">
            <parameter key="text_attribute" value="myDoc"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="9.2.001" expanded="true" height="103" name="Filter Examples" width="90" x="380" y="34">
            <parameter key="parameter_expression" value=""/>
            <parameter key="condition_class" value="custom_filters"/>
            <parameter key="invert_filter" value="false"/>
            <list key="filters_list">
              <parameter key="filters_entry_key" value="myDoc.does_not_contain.reference"/>
            </list>
            <parameter key="filters_logic_and" value="true"/>
            <parameter key="filters_check_metadata" value="true"/>
          </operator>
          <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="514" y="34">
            <parameter key="select_attributes_and_weights" value="false"/>
            <list key="specify_weights"/>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
          <connect from_op="Create Document (2)" from_port="output" to_op="Documents to Data" to_port="documents 2"/>
          <connect from_op="Create Document (3)" from_port="output" to_op="Documents to Data" to_port="documents 3"/>
          <connect from_op="Documents to Data" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    


  • mlubicz
    mlubicz New Altair Community Member
    Answer ✓
    Thank you for the inspiration. In fact the task was to split each pdf document into main text and references, and make Text Mining on the main text only, while the references should be saved as an example set (e.g. xlsx) - a desirable by-product.
    I tried to experiment with Split File by Content and Split File by Point which makes the same, however it is more convenient to have one file and not multiple segments.

Answers

  • kayman
    kayman New Altair Community Member
    Answer ✓
    Yeah, an option to filter documents based on content would be nice, but as far as I know it's not available.

    A workaround could be as follows : use the documents to data operator, and filter on the reference keyword. Next convert back to documents (or deal with it as data)

    Attached a very simplified example, might get you started.
    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
            <parameter key="text" value="this one contains reference"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document (2)" width="90" x="112" y="136">
            <parameter key="text" value="this one doesn't"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document (3)" width="90" x="112" y="238">
            <parameter key="text" value="this one doesn't either"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          </operator>
          <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="124" name="Documents to Data" width="90" x="246" y="34">
            <parameter key="text_attribute" value="myDoc"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="9.2.001" expanded="true" height="103" name="Filter Examples" width="90" x="380" y="34">
            <parameter key="parameter_expression" value=""/>
            <parameter key="condition_class" value="custom_filters"/>
            <parameter key="invert_filter" value="false"/>
            <list key="filters_list">
              <parameter key="filters_entry_key" value="myDoc.does_not_contain.reference"/>
            </list>
            <parameter key="filters_logic_and" value="true"/>
            <parameter key="filters_check_metadata" value="true"/>
          </operator>
          <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="514" y="34">
            <parameter key="select_attributes_and_weights" value="false"/>
            <list key="specify_weights"/>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
          <connect from_op="Create Document (2)" from_port="output" to_op="Documents to Data" to_port="documents 2"/>
          <connect from_op="Create Document (3)" from_port="output" to_op="Documents to Data" to_port="documents 3"/>
          <connect from_op="Documents to Data" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    


  • mlubicz
    mlubicz New Altair Community Member
    Answer ✓
    Thank you for the inspiration. In fact the task was to split each pdf document into main text and references, and make Text Mining on the main text only, while the references should be saved as an example set (e.g. xlsx) - a desirable by-product.
    I tried to experiment with Split File by Content and Split File by Point which makes the same, however it is more convenient to have one file and not multiple segments.