"Problems with filtering attributes with regex"

TobiasNehrig
TobiasNehrig New Altair Community Member
edited November 2024 in Community Q&A

Hi experts,

I have to create a cooccurrence graph and so I create a corpus and a occurrence matrix. With the occurrence matrix I have some problems, I can't get it to filter words with 3 or more letters for my analysing. When I use for example [(0-9)+][-!"#$%&'()*+,./:;<=>?@\[\\\]_`{|}~][(0-9)+] [(a-z){3,}] all coulums will be deleted.

 

Has anyone an idea to fix this problem?

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<parameter key="logfile" value="/home/knecht/Master2017/Rapp/Logfile.log"/>
<parameter key="resultfile" value="/home/knecht/Master2017/Rapp/resultfile.res"/>
<process expanded="true">
<operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="45" y="34">
<parameter key="url" value="http://www.fask.uni-mainz.de/user/rapp/papers/disshtml/main/main.html"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value="http://www.fask.uni-mainz.de/user/rapp/papers/disshtml/.*"/>
<parameter key="follow_link_with_matching_url" value="http://www.fask.uni-mainz.de/user/rapp/papers/disshtml.*"/>
</list>
<parameter key="max_crawl_depth" value="10"/>
<parameter key="retrieve_as_html" value="true"/>
<parameter key="add_content_as_attribute" value="true"/>
<parameter key="write_pages_to_disk" value="true"/>
<parameter key="output_dir" value="/home/knecht/Crawler"/>
<parameter key="max_pages" value="1000"/>
<parameter key="max_page_size" value="500"/>
<parameter key="user_agent" value="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0"/>
<parameter key="ignore_robot_exclusion" value="true"/>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="Get Pages" width="90" x="45" y="136">
<parameter key="link_attribute" value="Link"/>
<parameter key="page_attribute" value="link"/>
<parameter key="random_user_agent" value="true"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="45" y="238">
<parameter key="keep_text" value="true"/>
<list key="specify_weights">
<parameter key="link" value="1.0"/>
</list>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34">
<parameter key="minimum_text_block_length" value="2"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize Token" width="90" x="45" y="136">
<parameter key="mode" value="linguistic tokens"/>
<parameter key="language" value="German"/>
</operator>
<operator activated="true" class="text:filter_stopwords_german" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="45" y="238"/>
<operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="447" y="34"/>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Tokenize Token" to_port="document"/>
<connect from_op="Tokenize Token" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
<connect from_op="Filter Stopwords (German)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="179" y="34">
<parameter key="select_attributes_and_weights" value="true"/>
<list key="specify_weights">
<parameter key="text" value="1.0"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Data to Document" width="90" x="313" y="34"/>
<operator activated="true" class="write_as_text" compatibility="7.6.001" expanded="true" height="82" name="Write Korpus" width="90" x="447" y="34">
<parameter key="result_file" value="/home/knecht/Master2017/Korpus/17-12-01-Rapp-Korpus.res"/>
</operator>
<operator activated="true" class="text:wordlist_to_data" compatibility="7.5.000" expanded="true" height="82" name="WordList to Data" width="90" x="179" y="289"/>
<operator activated="true" class="write_excel" compatibility="7.6.001" expanded="true" height="82" name="Write Excel Wordlist" width="90" x="447" y="391">
<parameter key="excel_file" value="/home/knecht/17-12-01-Rapp-Wordlist.xlsx"/>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="187">
<parameter key="text_attribute" value="text"/>
<parameter key="label_attribute" value="text"/>
<parameter key="data_management" value="memory-optimized"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply" width="90" x="313" y="187"/>
<operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="289">
<parameter key="attribute_filter_type" value="regular_expression"/>
<parameter key="regular_expression" value="[(0-9)+][-!&quot;#$%&amp;'()*+,./:;&lt;=&gt;?@\[\\\]_`{|}~][(0-9)+] [(a-z){3,}] "/>
<parameter key="value_type" value="text"/>
<parameter key="use_value_type_exception" value="true"/>
<parameter key="except_value_type" value="text"/>
<parameter key="block_type" value="value_matrix"/>
</operator>
<operator activated="true" class="write_excel" compatibility="7.6.001" expanded="true" height="82" name="Write Excel Korpus" width="90" x="447" y="187">
<parameter key="excel_file" value="/home/knecht/17-12-01-Rapp-RohMatrix.xlsx"/>
</operator>
<connect from_op="Crawl Web" from_port="example set" to_op="Get Pages" to_port="Example Set"/>
<connect from_op="Get Pages" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Data to Document" to_port="input"/>
<connect from_op="Data to Document" from_port="output 1" to_op="Write Korpus" to_port="input 1"/>
<connect from_op="Data to Document" from_port="output 2" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Write Korpus" from_port="input 1" to_port="result 1"/>
<connect from_op="WordList to Data" from_port="word list" to_port="result 4"/>
<connect from_op="WordList to Data" from_port="example set" to_op="Write Excel Wordlist" to_port="input"/>
<connect from_op="Write Excel Wordlist" from_port="through" to_port="result 5"/>
<connect from_op="Documents to Data" from_port="example set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Write Excel Korpus" to_port="input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_port="result 3"/>
<connect from_op="Write Excel Korpus" from_port="through" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
</process>
</operator>
</process>

 

17-12-02-Crawler Process.png

Welcome!

It looks like you're new here. Sign in or register to get started.

Best Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓

    Inside your "Process Documents" after you have Tokenized your words,simply use the "Filter Token by Length" operator and set it to minimum length desired.  That's a much easier way to get to what you are trying to accomplish I think.

  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓

    Hi!

     

    You have a highly complex and very specific regex. I wasn't even able to find a text that it matches.

    The use of character classes [] and parentheses () the way you're doing it is not very common. This would be more standard usage: [a-z()] (if you're really matching lower case characters and the opening and closing parentheses).

     

    The regexp also has a space at the end.

    In Select Attributes, the regexp must match the whole attribute name. (Usually regexes just need to match a part of the target, Select Attributes is different in this regard.)

     

    When developing regexes, it's best to start from a simple state and then build up on that, using RapidMiner's testing methods.

     

    If I understand your problem, the regex (\w+-){2}\w+ would be a simple representation of "word-word-word". You can start from this and build upon it. 

     

    Regards,

    Balázs

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓

    Inside your "Process Documents" after you have Tokenized your words,simply use the "Filter Token by Length" operator and set it to minimum length desired.  That's a much easier way to get to what you are trying to accomplish I think.

  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓

    Hi!

     

    You have a highly complex and very specific regex. I wasn't even able to find a text that it matches.

    The use of character classes [] and parentheses () the way you're doing it is not very common. This would be more standard usage: [a-z()] (if you're really matching lower case characters and the opening and closing parentheses).

     

    The regexp also has a space at the end.

    In Select Attributes, the regexp must match the whole attribute name. (Usually regexes just need to match a part of the target, Select Attributes is different in this regard.)

     

    When developing regexes, it's best to start from a simple state and then build up on that, using RapidMiner's testing methods.

     

    If I understand your problem, the regex (\w+-){2}\w+ would be a simple representation of "word-word-word". You can start from this and build upon it. 

     

    Regards,

    Balázs

  • TobiasNehrig
    TobiasNehrig New Altair Community Member

    Thank you very much.

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.