Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
"tokenize and keep words with dash"
johannesweber
Hello,
is there any way to tokenize into single words and don't split words with a dash?
For example, I want to keep the word "state-of-the-art" instead of having four words afterwards.
I saw the option to change the operator's mode to "specific characters", however I don't understand the syntax requiered.
I would much appreciate an answer.
Best regards
Johannes
Find more posts tagged with
AI Studio
Text Mining + NLP
Accepted answers
All comments
MariusHelf
Specific characters is fine, just list the characters that indicate word borders, e.g. dot, comma, space, questionmark etc.: "!? ,.". Think carefully and check the results to not forget any important delimiters
Best regards,
Marius
HelenZ
This is a really good suggestion and very helpful. I tried using the "." to tokenize my document. But now, I face the Problem that a sentence containing e.g. the word "u.s." is tokenized right in the middle because u.s. contains a dot. Or to take another example a sentence containing the number "1.3%" is split.
So is there a way to also include exceptions in the mode "specific characters" and what regex term do I use then? Or do I have to add another operator or something?
Thank you for your great help. This is very much appreciated.
Helen
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups