Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
Removing StopWords using Dictionary
Hyram
Hi
I am using my own dictionary to remove Stopwords. On close analysis, words like "is" are not being removed, although they are in the dictionary. Any clue as to why this is happening?
Thanks,
Hyram
Find more posts tagged with
AI Studio
Text Mining + NLP
Accepted answers
kayman
The process flow seems correct at first glance, so just some additional questions
- How do you do word tokenization? If this one is incorrect you might still take full sentences as a token
- Do you transform to upper or lower case? since you are looking for 'is' I assume lower case
- Next you filter by length, as 'is' only contains 2 characters I assume you filter everything that's at least 2 characters. If not than 'is' should be stripped already here so again linked to how you do your word tokenization.
- How is your dictionary constructed? Every stopword on a new line without any spaces? As you are using the NLTK list it may contain additional characters that RM doesn't like to use.
You can also use the out of the box 'filter stopwords (english)', it's very similar to the NLTK one as far as I know.
kayman
Yeah, it's a bit tricky sometimes. A word as 'like' can have a big impact on for instance sentiment analysis, so I personally wouldn't consider it as a generic stopword.
What I typically do is combine both out of the box stop words and a personal addition.
Anyway, if it was gone when using the out of the box option, but remains with the NLTK doc there is indeed probably something with the format used and how it's read.
Easiest would be to just save it as a plain and simple txt file rather than a docx file, this way you're sure there is nothing missed or added
All comments
kayman
Can you share your process? No need to add data, just the process itself.
Hyram
Yes sure, thanks
@kayman
Attached
For the dictionary, I am using NLTK stopswords. Not sure if my encoder setting is right?
kayman
The process flow seems correct at first glance, so just some additional questions
- How do you do word tokenization? If this one is incorrect you might still take full sentences as a token
- Do you transform to upper or lower case? since you are looking for 'is' I assume lower case
- Next you filter by length, as 'is' only contains 2 characters I assume you filter everything that's at least 2 characters. If not than 'is' should be stripped already here so again linked to how you do your word tokenization.
- How is your dictionary constructed? Every stopword on a new line without any spaces? As you are using the NLTK list it may contain additional characters that RM doesn't like to use.
You can also use the out of the box 'filter stopwords (english)', it's very similar to the NLTK one as far as I know.
Hyram
@kayman
thanks for looking. Some answers to your questions:
1. I am using 'non-letters' to tokenise my words and it seems to work. No full sentences as a result;
2. Correct, I transform to lower case;
3. Correct - I filter by length of 2 i.e. any characters with < 2 are out
4. You have a good point as I have not checked this. I basically cut and pasted it into a Word doc
I initially used 'filter Stopwords (English)' but it was excluding words like 'like' which I wanted to keep.
Thanks!
kayman
Yeah, it's a bit tricky sometimes. A word as 'like' can have a big impact on for instance sentiment analysis, so I personally wouldn't consider it as a generic stopword.
What I typically do is combine both out of the box stop words and a personal addition.
Anyway, if it was gone when using the out of the box option, but remains with the NLTK doc there is indeed probably something with the format used and how it's read.
Easiest would be to just save it as a plain and simple txt file rather than a docx file, this way you're sure there is nothing missed or added
Hyram
Thanks
@kayman
Really appreciate your help! Will try what the operator notes suggest which is inline with what you are saying re txt format.
Hyram
@kayman
Your suggestion re file format worked. Thank you!
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups