"tokenize and keep words with dash"
johannesweber
New Altair Community Member
Hello,
is there any way to tokenize into single words and don't split words with a dash?
For example, I want to keep the word "state-of-the-art" instead of having four words afterwards.
I saw the option to change the operator's mode to "specific characters", however I don't understand the syntax requiered.
I would much appreciate an answer.
Best regards
Johannes
is there any way to tokenize into single words and don't split words with a dash?
For example, I want to keep the word "state-of-the-art" instead of having four words afterwards.
I saw the option to change the operator's mode to "specific characters", however I don't understand the syntax requiered.
I would much appreciate an answer.
Best regards
Johannes
Tagged:
0
Answers
-
Specific characters is fine, just list the characters that indicate word borders, e.g. dot, comma, space, questionmark etc.: "!? ,.". Think carefully and check the results to not forget any important delimiters
Best regards,
Marius0 -
This is a really good suggestion and very helpful. I tried using the "." to tokenize my document. But now, I face the Problem that a sentence containing e.g. the word "u.s." is tokenized right in the middle because u.s. contains a dot. Or to take another example a sentence containing the number "1.3%" is split.
So is there a way to also include exceptions in the mode "specific characters" and what regex term do I use then? Or do I have to add another operator or something?
Thank you for your great help. This is very much appreciated.
Helen0