Cannot filter tokens (sentences) by content using regular expressions

KateSh
KateSh New Altair Community Member
edited November 2024 in Community Q&A
Hello everyone!
I'm new to text mining. A very simple task turned out to be an unsolvable one for me  :(

I have 50 pdf documents in English. From there I need to extract the sentences which contain at least one modal verb (for further analysis).
Inside the "process documents from files" operator I created "tokenize" (linguistic sentences) and "filter tokens by content" operators. In "filter tokens by content" I wrote the verbs divided by a vertical line with no spaces, but it doesn't work, the results are empty. It works fine if I write only one verb, but if I write many verbs with a vertical line, it doesn't. I tried all the conditions of the operator, none of them make it work.
I will be very grateful for help!
Here is my process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="313" y="120">
        <list key="text_directories">
          <parameter key="pdf" value="D:\Все\УЧЁБА\ВКР\Материал\Оригинальные"/>
        </list>
        <parameter key="file_pattern" value="*pdf"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="179" y="120">
            <parameter key="mode" value="linguistic sentences"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="380" y="120">
            <parameter key="condition" value="matches"/>
            <parameter key="string" value="can|could|may|might"/>
            <parameter key="regular_expression" value="can|could|may|might"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
          <connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>


Welcome!

It looks like you're new here. Sign in or register to get started.

Best Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓
    Hi @KateSh,

    Did you try the parameter "contains" instead of "matches" ?

    Regards,

    Lionel
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓
    Hi again @KateSh,

    Otherwise, did you try the Filter Tokens Using Example Set operator : Check the tutorial of this process

    Regards,

    Lionel

Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓
    Hi @KateSh,

    Did you try the parameter "contains" instead of "matches" ?

    Regards,

    Lionel
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓
    Hi again @KateSh,

    Otherwise, did you try the Filter Tokens Using Example Set operator : Check the tutorial of this process

    Regards,

    Lionel
  • KateSh
    KateSh New Altair Community Member
    Thank you so mush, it helped!
    (Sorry I didn't answer earlier, I was very busy yesterday)

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.