HTML Tag Removal using Regular Expression/Replace Tokens
Hello friends,
I am faced with a huge txt file containing huge amounts of HTML tags. I want to remove all HTML tags with regular expression using "Replace Tokens" in Rapidminer so I am able to read only pure text.
Since my file is so big (U.S. Securities and Exchange Commission Annual Report text file) I can not even identify all HTML tags within the file.
Due to complex tagging <Tag> <<Tag>Tag> TEXT to extract <Tag> <<Tag>Tag> and due to the fact I do not "see" all tags it is hard for me to find the right regex.
I realised that all text parts basically starts with > (end of Tag) and ends with < (start of new tag).
Is there a regular expression giving me only >Text< since I want to extract only text parts ?
Thanks for your help !!!
I am faced with a huge txt file containing huge amounts of HTML tags. I want to remove all HTML tags with regular expression using "Replace Tokens" in Rapidminer so I am able to read only pure text.
Since my file is so big (U.S. Securities and Exchange Commission Annual Report text file) I can not even identify all HTML tags within the file.
Due to complex tagging <Tag> <<Tag>Tag> TEXT to extract <Tag> <<Tag>Tag> and due to the fact I do not "see" all tags it is hard for me to find the right regex.
I realised that all text parts basically starts with > (end of Tag) and ends with < (start of new tag).
Is there a regular expression giving me only >Text< since I want to extract only text parts ?
Thanks for your help !!!