Using "Cut Document" Operator neglects numbers and punctuation in HTML text

Question

Hi everyone, I am currently using the "Cut Document" Operator with query type "Regular Region" to extract specific text out of locally stored HTML files. This works pretty good so far, however it seems as all numbers in the text are being neglected. i.e. Original Text: Companies Act 2006. Our audit work has been undertaken so that we might state to the company's members those concerning the cost of the fixed asset investment, stated at £51,925 in note 6 to the financial statements. Text after extraction: Companies Act Our audit work has been undertaken so that we might state to the company s members those concerning the cost of the fixed asset investment stated at Â in note to the financial statements Also punctuation characters like , and . are neglected. Anyone has an idea if there is a setting to get both, punctuation characters and numbers? My code right now looks like this:

Limegreenman900_1 · Answer

Ok, it looks like that it has been due to my "Tokenize" Operator I used in "Cut Documents". If I am using my process without it I get plain text with punctuation and numbers.

If I use "linguistic tokens - english" as setting in the tokenize operator it works perfectly.