🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"tokenize BUG (Text processing)"

User: "danong"
New Altair Community Member
Updated by Jocelyn
hi i am using tokenizer (text processing),
and using 'specify characters' option,
and my specified character's parameters are symbols and numbers (.:@/_",*$#!?^ ()<>+-%'"[]{}~`0123456789)
so i have gotten my tokens as english,

however when i filter the stopwords (english) and stemming (porter),
i figured that it has a bug,
which the  results i obtained does not stem words correctly :


for example,  "apply", "applies" ---- they are seperated but not combined,
and what more weird is that it generates a new keyword "appli" which never existed in the original documents.

however, when i use tokenizer (non-letters),
the stemming are correct and they all categorized into 'apply' keywords.

is that a bug or anything else?

how am i going to resolve this problem?

i prefer using 'specify character' option because i would like some special character to be retained.



Thanks.

Find more posts tagged with

Comments

No comments on this post.