"filter by upper case letter?"

erocoar
erocoar New Altair Community Member
edited November 2024 in Community Q&A
Hey there,

I just recently installed Rapid Miner for a university project. I only worked with R so far so this is quite new and challenging for me.
I want to extract text from newspaper frontpages as part of analyzing agenda setting in German politics.

My question would be if it is possible to filter by upper case letter... German nouns start with upper case and I would like to filter that. Unfortunately, I have no idea how to do that. Any help is appreciated :)

Answers

  • JEdward
    JEdward New Altair Community Member
    It's a bit early for me today, but you should be able to do it with Filter Tokens & a regular expression. 

    Don't be scared of regular expressions this one is especially straightforward.
    - ^ means start at the beginning of the text, as you are filtering within the tokens the start should be
    - [A-Z] means any uppercase letter between A & Z
    - . dot means any character at all.
    - * asterix means any number of the preceding element (in this case . )

    Have a play with the example below, simply copy & paste the XML into the XML view of RapidMiner and press the green tick to load it. 
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.4.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="6.4.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="210">
            <parameter key="text" value="this is Some text with Capital Letters and mixed with nonCapital letters. "/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="6.4.001" expanded="true" height="94" name="Process Documents" width="90" x="179" y="120">
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="6.4.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="120">
                <parameter key="mode" value="linguistic tokens"/>
                <parameter key="language" value="German"/>
              </operator>
              <operator activated="true" class="text:filter_tokens_by_content" compatibility="6.4.001" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="246" y="120">
                <parameter key="condition" value="matches"/>
                <parameter key="regular_expression" value="^[A-Z].*"/>
                <parameter key="case_sensitive" value="true"/>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
              <connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi erocoar,

    if you are interested in german nouns, you can use Filter POS as well. There you can specifically search for Nouns, Adjectives etc. German and English are supported. The process below uses it to get nouns out of the document. Of course you can use this in Process Documents. Further details on the syntax is available on: http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html

    ~Martin

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.5.002">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.5.002" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="6.5.000" expanded="true" height="60" name="Create Document" width="90" x="45" y="165">
            <parameter key="text" value="Dies ist ein Test."/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="6.5.000" expanded="true" height="60" name="Tokenize" width="90" x="246" y="165"/>
          <operator activated="true" class="text:filter_tokens_by_pos" compatibility="6.5.000" expanded="true" height="60" name="Filter Tokens (by POS Tags)" width="90" x="447" y="165">
            <parameter key="language" value="German"/>
            <parameter key="expression" value="NN"/>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by POS Tags)" to_port="document"/>
          <connect from_op="Filter Tokens (by POS Tags)" from_port="document" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • erocoar
    erocoar New Altair Community Member
    Oh amazing! Thank you so much :) This really helps a lot. JEdward, how did you manage to turn filter tokens from string to regular expression?