🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Problem Processing Data and Filter Stopwords for LDA

User: "LaraNeu"
New Altair Community Member
Updated by Jocelyn
Hi,
I really need the help of you as a community. I already tried out all solutions that were suggested to others in community posts regarding the filter stopwords operator but nothing worked so far. I have reviews from which I want to extract topics with LDA. I followed tutorials on how to pre-process the data and filter stopwords etc. but unfortunately, it does not seem to work. Despite the transform cases into lowercase I still have words with capital letters in my output and it does not filter out the stopwords I attached in the .txt file. Also, the replace token operator does not seem to work. As I have the filter Tokens by POS operator (that takes a lot of time) I used a sample of only 100 (what can be enabled any time). I also tried it without the filter tokens by POS and with the whole data set. Unfortunately, it just does not seem to work. I attached all my files and processes. Could you please help me with my process? Thank you so much!

I am not sure if this goes too far for one post but can someone also tell me how to find out the ideal number of topics for LDA?

Thank you, Larissa


Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "jacobcybulski"
    New Altair Community Member
    Accepted Answer
    You have a number of issues in your process. If indeed you wanted to use Process Documents from Data to do some pre-processing of text before LDA then you need to keep text it generates ("keep text" option), it also means that LDA must then be processing the attribute "text" and not your original "Review". The "Review" is polynomial and it is not automatically of type text, so your intuition to use Nominal to Text was correct but you need to apply it to "Review". Next, you cannot filter the tokens by POS as you have not done any stemming and so no POS tags are present (you would need a dictionary stemming to get these). Finally, all your stop words would be eliminated by the default English stop word filter anyway, so do you really need it? Good luck!