"Retaining selected word pairs when tokenizing"

carl · 2016-12-28T17:27:56+00:00

When tokenizing into single word tokens, is there a way to keep selected pairs of words together as a single token? For example, in soccer the term "centre forward" makes more sense as a single token. I looked at n-grams, but this pairs words that I do not want to pair. I tried using the stem dictionary, but this seems not…