A program to recognize and reward our most engaged community members
<operator name="Root" class="Process" expanded="yes"> <operator name="TextInput" class="TextInput" expanded="yes"> <list key="texts"> <parameter key="graphics" value="../data/newsgroup/graphics"/> <parameter key="hardware" value="../data/newsgroup/hardware"/> </list> <parameter key="default_content_encoding" value="ISO-8859-1"/> <parameter key="prune_below" value="2"/> <list key="namespaces"> </list> <parameter key="create_text_visualizer" value="true"/> <parameter key="on_the_fly_pruning" value="3"/> <operator name="TokenReplace" class="TokenReplace"> <list key="replace_dictionary"> <parameter key="cantaloupe" value="cantaHORST"/> </list> </operator> <operator name="StringTokenizer" class="StringTokenizer"> </operator> <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter"> </operator> <operator name="TokenLengthFilter" class="TokenLengthFilter"> <parameter key="min_chars" value="3"/> </operator> <operator name="ToLowerCaseConverter" class="ToLowerCaseConverter"> </operator> <operator name="TermNGramGenerator" class="TermNGramGenerator"> </operator> </operator></operator>
this does not seem to work
Here is an up-to-date version:
<operator activated="true" class="process" compatibility="5.0.000" expanded="true" name="Root"> <process expanded="true"> <operator activated="true" class="text:create_document" compatibility="7.5.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34"> <parameter key="text" value="Some text about different kind of dances people might enjoy."/> </operator> <operator activated="true" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents" width="90" x="313" y="34"> <process expanded="true"> <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/> <operator activated="true" class="text:replace_tokens" compatibility="7.5.000" expanded="true" height="68" name="Replace Tokens" width="90" x="380" y="34"> <list key="replace_dictionary"> <parameter key="([a-zA-Z]+)s" value="$1"/> </list> </operator> <connect from_port="document" to_op="Tokenize" to_port="document"/> <connect from_op="Tokenize" from_port="document" to_op="Replace Tokens" to_port="document"/> <connect from_op="Replace Tokens" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/> <connect from_op="Process Documents" from_port="example set" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator>
Remark: Make sure to download the Text Processing Extension from the Marketplace in order for this solution to work.
Key element:
To extract a tokens substring, that matches a certain criteria, use the group feature of regular expressions. Here we identify token ending with 's' by using the expression ([a-zA-Z]+)s and refering to the targeted substring by the group identifier $1.
Hope it helps.