AMOUNT OF EXAMPLES DOES NOT CORRELATES WITH INPUT DATA LOADED FROM PDFs

antonio_heredia
antonio_heredia New Altair Community Member
edited November 2024 in Community Q&A
on="1.0" encoding="UTF-8"?><process version="8.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<parameter key="logverbosity" value="all"/>
<process expanded="true">
<operator activated="true" class="text:process_document_from_file" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="45" y="136">
<list key="text_directories">
<parameter key="Forging vs AM" value="C:\Users\xwb15193\Desktop\L.R AM vs F\ScienceDirect\ScienceDirect_articles_04Jul2018_11-57-34.507"/>
</list>
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="true"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
<parameter key="mode" value="linguistic tokens"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="246" y="34"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
<description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="77" y="85">Type your comment</description>
<description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="99" y="169">Type your comment</description>
</process>
</operator>
<operator activated="false" class="filter_examples" compatibility="8.2.001" expanded="true" height="103" name="Filter Examples" width="90" x="246" y="289">
<list key="filters_list">
<parameter key="filters_entry_key" value="label.contains.and"/>
</list>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

I tried to tokenize pdf articles, resulting in only 21 examples. Why does it happen? It should outcome many more. To do so, I used: "Process data from files" and inside I included "Tokenize" and "filter stopwords", Which again works but not throughout all the documents. What should I do to fix it?

 

Cheers,

 

Antonio

Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi @antonio_heredia,

     

    Do you have a lot of files ?

    Can your share these files in order we can reproduce what you observe ?

     

    Regards,

     

    Lionel

     

    NB : The first line of your XML process is broken, however I was able to repair it.

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.