A program to recognize and reward our most engaged community members
Marius wrote:Hi,in the post linked in my signature you will find a link to a tutorial site, which also covers text mining. You should have a look at those videos.Then, you can probably use Process Documents and inside a Tokenize operator which splits on sentence borders (!?.: etc). After that, use Filter Tokens to filter only relevant tokens which contain the word quality.The next steps depend a bit on how you define "describe quality". Please come back if you have any further questions, and describe a bit more detailed how the classification should work.Best,Marius
ABSURDABSURDITYABUSEABUSEDABUSERABUSERSABUSESABUSINGABUSIVEABUSIVELYABUSIVENESS...BADBADLY...
Marius wrote:Hi,I don't think that with a simple word list you can achieve as good results as from training a model. But let's first deal with the technical issues:To overcome the negation problem, you can use the n-grams operator, which combines adjacent tokens to new tokens. I.e. from "not" and "bad" it would create the new token "not_bad". Furthermore, your list contains a lot of variants of the same word. You can shorten the list by applying a Stemming operator on both the input data and the wordlist.Anything beyond this depends on you: how do you want to use the wordlist? Is a document negative, if it contains one element from the bad-list? Or 10 elements? Or more than 5% of its contents is found on the bad list? And without any positive examples, you won't be able to correctly classify documents which are positive, but nevertheless conain some words from the list.