A program to recognize and reward our most engaged community members
<?xml version="1.0" encoding="UTF-8" standalone="no"?><process version="6.5.002"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="6.5.002" expanded="true" name="Process"> <process expanded="true"> <operator activated="true" breakpoints="after" class="web:process_web" compatibility="6.5.000" expanded="true" height="60" name="Process Documents from Web" width="90" x="45" y="120"> <parameter key="url" value="https://fr.wikipedia.org/"/> <list key="crawling_rules"> <parameter key="store_with_matching_url" value=".*"/> <parameter key="follow_link_with_matching_url" value=".*"/> </list> <parameter key="add_pages_as_attribute" value="true"/> <parameter key="max_pages" value="100"/> <parameter key="domain" value="server"/> <parameter key="delay" value="100"/> <parameter key="max_threads" value="6"/> <process expanded="true"> <connect from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="replace" compatibility="6.5.002" expanded="true" height="76" name="Replace" width="90" x="246" y="120"> <parameter key="replace_what" value="(?s)<script.*?</script>"/> </operator> <operator activated="true" class="replace" compatibility="6.5.002" expanded="true" height="76" name="Replace (2)" width="90" x="380" y="120"> <parameter key="replace_what" value="(?s)<style.*?</style>"/> </operator> <operator activated="true" class="replace" compatibility="6.5.002" expanded="true" height="76" name="Replace (3)" width="90" x="514" y="120"> <parameter key="replace_what" value="(?s)<.*?>"/> </operator> <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="76" name="Generate Attributes" width="90" x="648" y="120"> <list key="function_descriptions"> <parameter key="language" value=""de""/> </list> </operator> <connect from_op="Process Documents from Web" from_port="example set" to_op="Replace" to_port="example set input"/> <connect from_op="Replace" from_port="example set output" to_op="Replace (2)" to_port="example set input"/> <connect from_op="Replace (2)" from_port="example set output" to_op="Replace (3)" to_port="example set input"/> <connect from_op="Replace (3)" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/> <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="108"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator></process>
HelloHow is the internet link likehttps://t.co/ghtydDelete from text?Does anyone know the regular expression?
There are a couple of options you have when you want to use regex, but you probably need to do it is several steps to be on the safe side.
If your structure is indeed like your example (<<Tag>Tag>) one way is to remove the 'correct' tags first by using this regex :
<\/?\w[^<>].*?>
read it a bit like 'select anything starting with a < , optionally followed by a tag closing thingy, then followed by a word character ([a-zA-Z]), then followed by anything but < or > untill the first >'
This will change <Tag> <<Tag>Tag> TEXT to extract <Tag> <<Tag>Tag> into <Tag> TEXT to extract <Tag>, and if you run the same regex again you will only keep your text.
Now, typically tags should have a closing indicator (</...) but these are missing in your example, so the regex also works for
<Tag> <<Tag>Tag> TEXT to extract </Tag> </</Tag>Tag> or any combination
Anyway, be carefull using regex, if there are actual <> used for greater than / less than instead of html tags you may remove more than needed, but all in all it should allow you to get started. (and kick the guy who created this bad html...)