Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Can I use POS expressions (chuncks) with the text mining operators?

Hi there,

I know I can use the Filter tokens (by POS tags) to filter out single POS tags, but how would I generate chuncks ?

I am for instance interested in combinations of adjectives and nouns, or noun sequences, but this does not seem to work for me

Let's assume I have a dummy sentence like this one : "I have a broken computer, there is no picture, this thing sucks"

I would like to chunck this uing for instance (JJ.* NN.*+)|(DT NN.*+)|NN.*+

-> so either an adjective folowed by noun(s), or a determinator followed by a noun, or a simple noun phrase

so my output would become after some further processing something like [broken computer],[no picture],[thing sucks]

But the operater seems to accept single POS tags only. Is this correct or am I doing it completely wrong?

Find more posts tagged with

AI Studio

Text Mining + NLP

Accepted answers

All comments

Thomas_Ott

What is your tokenizer set at? non letters? Have you tried setting it to linguistic sentences?

kayman

Hi Thomas,

As I was testing on single sentences (handpicked) I did not use a tokenizer yet. The POS operator works pretty ok when selecting a single POS (like JJ.*|NN.*) but seems to be unable to handle sequences (so like any JJ followed by NN).

I can do the same with a python operator so I am not really stuck if RM dos not support it, it would just be nice to be able to do it with the standard operators. maybe something for the next version ?

Or I may be having issues with the syntax, not too sure about that one either

Thomas_Ott

It should be able to do it because on your regex structure, but I think it needs to operate inside the Process Doc from Data operator with a Tokenize operator set to linguistic sentences.

Try this:

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="112" y="34">
        <parameter key="connection" value="ThomasOtt"/>
        <parameter key="query" value="#iphone"/>
        <parameter key="limit" value="10"/>
        <parameter key="language" value="en"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.4.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.4.000" expanded="true" height="82" name="Nominal to Text" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.4.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="34">
        <parameter key="prune_method" value="percentual"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34">
            <parameter key="mode" value="linguistic sentences"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_pos" compatibility="7.4.001" expanded="true" height="68" name="Filter Tokens (by POS Tags)" width="90" x="179" y="34">
            <parameter key="expression" value="JJ.*|NN.*"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by POS Tags)" to_port="document"/>
          <connect from_op="Filter Tokens (by POS Tags)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

kayman

Hi Thomas,

maybe I was not clear enough. The example you show works, but it filters on either JJ or NN tag, not on a sequence of these. The OR worked for me also (even without tokenizing on sentences) but I need more of an AND scenario

What I need to achieve is a filter on JJ, only if followed by one or more NN (or other combinations)

Assume I have following sentences :

"hello what do I need to be able to group multiple pos tokens? Can I use regular groups or is that too complex?"

Using a chunkrule like this one in python <JJ><NN.*>+ would return me

['multiple pos tokens', 'regular groups']

Using the expression JJ.*|NN.* as in the RM example logic correctly returns

['able', 'group', 'multiple', 'pos', 'tokens', 'regular', 'groups', 'complex']

So the option to group POS tags provides much more powerful options, but given that using an expression like JJ NN.* returns an empty match I assume this is not possible.

Hope this makes it more clear

Thomas_Ott

Hmm, in this case I'm stumped. Maybe @mschmitz has an idea.

MartinLiebig

Puh, this is rather a question for @hhomburg or @RalfKlinkenberg