"tokenize and keep words with dash"

johannesweber
johannesweber New Altair Community Member
edited November 5 in Community Q&A
Hello,

is there any way to tokenize into single words and don't split words with a dash?

For example, I want to keep the word "state-of-the-art" instead of having four words afterwards.

I saw the option to change the operator's mode to "specific  characters", however I don't understand the syntax requiered.

I would much appreciate an answer.

Best regards

Johannes

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Specific characters is fine, just list the characters that indicate word borders, e.g. dot, comma, space, questionmark etc.: "!? ,.". Think carefully and check the results to not forget any important delimiters :)

    Best regards,
    Marius
  • HelenZ
    HelenZ New Altair Community Member
    This is a really good suggestion and very helpful. I tried using the "." to tokenize my document. But now, I face the Problem that a sentence containing e.g. the word "u.s." is tokenized right in the middle because u.s. contains a dot. Or to take another example a sentence containing the number "1.3%" is split.

    So is there a way to also include exceptions in the mode "specific characters" and what regex term do I use then? Or do I have to add another operator or something?


    Thank you for your great help. This is very much appreciated.


    Helen