🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

HTML Tag Removal using Regular Expression/Replace Tokens

User: "J_Hering"
New Altair Community Member
Updated by Jocelyn
Hello friends,

I am faced with a huge txt file containing huge amounts of HTML tags. I want to remove all HTML tags with regular expression using "Replace Tokens" in Rapidminer so I am able to read only pure text.
Since my file is so big (U.S. Securities and Exchange Commission Annual Report text file) I can not even identify all HTML tags within the file.

Due to complex tagging <Tag> <<Tag>Tag> TEXT to extract <Tag> <<Tag>Tag> and due to the fact I do not "see" all tags it is hard for me to find the right regex.

I realised that all text parts basically starts with > (end of Tag) and ends with < (start of new tag).
Is there a regular expression giving me only >Text< since I want to extract only text parts ?

Thanks for your help !!!






Find more posts tagged with