Can I use POS expressions (chuncks) with the text mining operators?
Hi there,
I know I can use the Filter tokens (by POS tags) to filter out single POS tags, but how would I generate chuncks ?
I am for instance interested in combinations of adjectives and nouns, or noun sequences, but this does not seem to work for me
Let's assume I have a dummy sentence like this one : "I have a broken computer, there is no picture, this thing sucks"
I would like to chunck this uing for instance (JJ.* NN.*+)|(DT NN.*+)|NN.*+
-> so either an adjective folowed by noun(s), or a determinator followed by a noun, or a simple noun phrase
so my output would become after some further processing something like [broken computer],[no picture],[thing sucks]
But the operater seems to accept single POS tags only. Is this correct or am I doing it completely wrong?
Answers
-
What is your tokenizer set at? non letters? Have you tried setting it to linguistic sentences?
0 -
Hi Thomas,
As I was testing on single sentences (handpicked) I did not use a tokenizer yet. The POS operator works pretty ok when selecting a single POS (like JJ.*|NN.*) but seems to be unable to handle sequences (so like any JJ followed by NN).
I can do the same with a python operator so I am not really stuck if RM dos not support it, it would just be nice to be able to do it with the standard operators. maybe something for the next version ?
Or I may be having issues with the syntax, not too sure about that one either0 -
It should be able to do it because on your regex structure, but I think it needs to operate inside the Process Doc from Data operator with a Tokenize operator set to linguistic sentences.
Try this:
<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="112" y="34">
<parameter key="connection" value="ThomasOtt"/>
<parameter key="query" value="#iphone"/>
<parameter key="limit" value="10"/>
<parameter key="language" value="en"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.4.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.4.000" expanded="true" height="82" name="Nominal to Text" width="90" x="380" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.4.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="34">
<parameter key="prune_method" value="percentual"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34">
<parameter key="mode" value="linguistic sentences"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_pos" compatibility="7.4.001" expanded="true" height="68" name="Filter Tokens (by POS Tags)" width="90" x="179" y="34">
<parameter key="expression" value="JJ.*|NN.*"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by POS Tags)" to_port="document"/>
<connect from_op="Filter Tokens (by POS Tags)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Hi Thomas,
maybe I was not clear enough. The example you show works, but it filters on either JJ or NN tag, not on a sequence of these. The OR worked for me also (even without tokenizing on sentences) but I need more of an AND scenario
What I need to achieve is a filter on JJ, only if followed by one or more NN (or other combinations)Assume I have following sentences :
"hello what do I need to be able to group multiple pos tokens? Can I use regular groups or is that too complex?"
Using a chunkrule like this one in python <JJ><NN.*>+ would return me['multiple pos tokens', 'regular groups']
Using the expression JJ.*|NN.* as in the RM example logic correctly returns
['able', 'group', 'multiple', 'pos', 'tokens', 'regular', 'groups', 'complex']
So the option to group POS tags provides much more powerful options, but given that using an expression like JJ NN.* returns an empty match I assume this is not possible.
Hope this makes it more clear
1 -
Hmm, in this case I'm stumped. Maybe @mschmitz has an idea.
0 -
Puh, this is rather a question for @hhomburg or @RalfKlinkenberg
0