A program to recognize and reward our most engaged community members
<?xml version="1.0" encoding="UTF-8" standalone="no"?><process version="5.3.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process"> <process expanded="true" height="568" width="587"> <operator activated="true" class="filter_examples" compatibility="5.3.000" expanded="true" height="76" name="Filter Examples" width="90" x="45" y="30"> <parameter key="condition_class" value="attribute_value_filter"/> <parameter key="parameter_string" value="text = .*again.*|.*delivery.*|.*fast.*"/> </operator> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> </process> </operator></process>
<?xml version="1.0" encoding="UTF-8" standalone="no"?><process version="5.2.008"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process"> <process expanded="true" height="341" width="756"> <operator activated="true" class="read_database" compatibility="5.2.008" expanded="true" height="60" name="Read Database" width="90" x="45" y="75"> <parameter key="connection" value="sqlserver"/> <parameter key="query" value="SELECT `Bewertung` FROM `training_schnell`"/> <enumeration key="parameters"/> </operator> <operator activated="true" class="nominal_to_text" compatibility="5.2.008" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="75"/> <operator activated="true" class="text:process_document_from_data" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="75"> <parameter key="prunde_below_percent" value="5.0"/> <parameter key="prune_above_percent" value="100.0"/> <list key="specify_weights"/> <process expanded="true" height="345" width="774"> <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"> <parameter key="mode" value="specify characters"/> <parameter key="characters" value=".:,:;:!:?:|:"/> </operator> <operator activated="true" class="text:filter_by_length" compatibility="5.2.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="45" y="120"> <parameter key="max_chars" value="9999"/> </operator> <operator activated="true" class="text:filter_stopwords_german" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (German)" width="90" x="45" y="210"/> <operator activated="true" class="text:stem_german" compatibility="5.2.004" expanded="true" height="60" name="Stem (German)" width="90" x="179" y="30"/> <operator activated="false" class="text:filter_tokens_by_content" compatibility="5.2.004" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="447" y="165"> <parameter key="string" value="schnell "/> <parameter key="regular_expression" value="(schnell)"/> </operator> <connect from_port="document" to_op="Tokenize" to_port="document"/> <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/> <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/> <connect from_op="Filter Stopwords (German)" from_port="document" to_op="Stem (German)" to_port="document"/> <connect from_op="Stem (German)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="false" class="text:wordlist_to_data" compatibility="5.2.004" expanded="true" height="76" name="WordList to Data" width="90" x="313" y="210"/> <operator activated="true" class="filter_examples" compatibility="5.2.008" expanded="true" height="76" name="Filter Examples" width="90" x="514" y="30"> <parameter key="condition_class" value="attribute_value_filter"/> <parameter key="parameter_string" value="Bewertung = .*wieder.*|.*lieferung.*|.*schnell.*"/> </operator> <operator activated="true" class="write_excel" compatibility="5.2.008" expanded="true" height="76" name="Write Excel" width="90" x="514" y="165"> <parameter key="excel_file" value="C:\Users\MP-TEST\Desktop\Rapid_Test\Klein.xls"/> </operator> <connect from_op="Read Database" from_port="output" to_op="Nominal to Text" to_port="example set input"/> <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/> <connect from_op="Process Documents from Data" from_port="example set" to_op="Filter Examples" to_port="example set input"/> <connect from_op="Filter Examples" from_port="example set output" to_op="Write Excel" to_port="input"/> <connect from_op="Write Excel" from_port="through" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator></process>
.*dog.*|.*cat.*|.*fish.*
*dog.*|.*cat.*|.*fish.*
.*(cat|dog|fish).*
It does exactly the same, it reads as 'take whatever you want (the dot), as many times as you like (the asterix) followed by either cat, dog or fish, and then again followed by whatever as much as you want.
This is what we call a greedy pattern, we don't care of what we get and how much we have. This si typically no problem when dealing with small sentences, but can cost you a lot of memory when you have long content.
so one small improvement already :
.*?(cat|dog|fish).*^
Ok, 2 small changes. The first is the 'hat' (^), which means, begin at the start of the sentence, and the question mark, which means 'end at the first match. So using ^.*? is short for begin at the start, and end as soon as you find the first match. This can save quite some time again with large texts, as the original one will just keep looking for matches untill he is at the end of the sentence.
Now, we still can only match lower case, and while it is good practice to set all of your cases either lower or upper in a text analysis workflow, there are occasions where we need the difference of course. Anyway, to ignore cases we use the i flag as follows :
^.*?(cat|dog|fish).*(?i)
^
So now it will find cat, Cat, CAT, and whatever else. Should that be a requirement of course.
You can combine many flags together, while the i flag means ignore case, the m flag can be used to indicate you can have multiple lines. combining them as below would mean that every sentence, when using line breaks, would get the same treatment.
^.*?(cat|dog|fish).*</code><code>(?im)
the order doesn't matter, (?mi) would work exactly the same.
Now, we still have the problem we can get things like category or hotdog in the results, so the final part would be to use the word boundary, so that we are ensured we only get a match when it is exactly the same word. A word boundary can be anything like a comma, a dot, a space, end or beginning of sentence etc. Luckily there is a little helper again, so the below will give you an exact match, stop at the first match, looking at every line you have.
(?im)^.*?\b(cat|dog|fish)\b.*</code><code>
(?im)
As an alternative you could also use the s flag when you have a lot of line breaks, and this will ignore all linebreaks and treat your text as one single line.
^.*?\b(cat|dog|fish)\b.*</code><code>(?is)
FINAL EDIT : it seems the code block screws the content a bit up, all of the symbols used need to be in one single line.