HTML Tag Removal using Regular Expression/Replace Tokens

New Altair Community Member

Jan 25, 2016

Updated Nov 5, 2024 by Jocelyn

Hello friends,

I am faced with a huge txt file containing huge amounts of HTML tags. I want to remove all HTML tags with regular expression using "Replace Tokens" in Rapidminer so I am able to read only pure text.
Since my file is so big (U.S. Securities and Exchange Commission Annual Report text file) I can not even identify all HTML tags within the file.

Due to complex tagging <Tag> <<Tag>Tag> TEXT to extract <Tag> <<Tag>Tag> and due to the fact I do not "see" all tags it is hard for me to find the right regex.

I realised that all text parts basically starts with > (end of Tag) and ends with < (start of new tag).
Is there a regular expression giving me only >Text< since I want to extract only text parts ?

Thanks for your help !!!

Find more posts tagged with

AI Studio

RegEx

🎉Community Raffle - Win $25

HTML Tag Removal using Regular Expression/Replace Tokens

Find more posts tagged with

Quick Links