What's the best way to determine the number of topics in the Extract Topics from Data (LDA) operator

New Altair Community Member

May 20, 2020

Updated Nov 5, 2024 by Jocelyn

I have a dataset made of thousands of ways users have listed product names. For example, Apple MacBook, MacBook, MacBookPro, etc. There are all sorts of products included, but I'm trying to group similar ways people have described them into clusters. The Extract Topics from Data operator seems to be doing the trick but I'm manually having to choose the number of groups. Is there a way to determine the number of groups based on similarity? I hope this makes sense.

Find more posts tagged with

AI Studio

Text Mining + NLP

Sort by:

1 - 6 of 61

lionelderkrikor

New Altair Community Member

Accepted Answer

May 20, 2020

Hi @cmoten,

In RapidMiner, in first approximation, I see the following method (method to be confirmed by @mschmitz : Extract Topics - LDA- operator is Martin's baby ...

) :

Use an Optimize parameters (grid) operator and plot the "Perplexity" according to the number of topic(s) k :
The lower the perplexity, the better the model.
For example in the example below, the "optimal" number of topics k is 6 :

Image: https://us.v-cdn.net/6030995/uploads/editor/3i/oisvam9pqqx0.png

In attached file, an example of process to find the optimal number of topics using Optimize Parameters (Grid) operator.

Regards,

Lionel

Extract_Topics_optimal_k.rmp

cmoten

New Altair Community Member

May 21, 2020

Thank you so much for the example. This helps a lot. It looks like you are splitting the text on commas and saving them as columns. You then flip the data around so it lists the columns as rows and renames the last column to “text”. You then append all the individual example sets into one.

The Optimization Parameter determines that the optimal number of topics is 6, but it seems like the number of topics listed on the Extract Topics from Data operator still shows 10. The results from the Optimization Parameter are being passed through as a parameter for Extract Topics. I think I get how it works.

I tried applying to my dataset, and initially received an error. I think the overall size was too large, so I took a sample of the data and it worked. The results didn’t get me what I was looking for, but I will have another process to add to my tool belt. I’ll keep experimenting with it. Thanks again for the help.

Extract_Topics_optimal_k.rmp

LaraNeu

New Altair Community Member

Jan 9, 2021

Hi, I am so happy I found this post as I am I need to find the optimum number of topics for my LDA analysis. Thank you for the process! I ran it on multiple datasets to test it but strangely the result is always 5 topics for any dataset I use. Am I doing anything wrong? Do I have to adjust something in the process besides changing the dataset? Please let me know if you can help. Thanks a lot!

Extract_Topics_optimal_k.rmp

MartinLiebig

Altair Employee

Jan 11, 2021

Hi @LaraNeu ,

somewhat a tricky question. Perplexity gives you a hint where to look, but sometimes you just need to check yourself, because there are sometimes just multiple 'correct solutions'. My prime example is the article i wrote here: https://towardsdatascience.com/topic-mining-on-amazon-reviews-ae76fc286c61 . With low number of topics you have a 'hot beverages' topic. Using more topics it splits into tea and coffee. Both make sense, but you need to decide what you want.

Metric wise I am a fan of exclusivity because I got better in interpreting it.

Best,

Martin

Extract_Topics_optimal_k.rmp

Muhammed_Fatih_

New Altair Community Member

Feb 17, 2021

Hi @lionelderkrikor,
Hi @mschmitz,

first of all thank you for your contributions! That is a very interesting approach!

I am interested at the question to which extent additional quality measures can be considered beside Perplexity in RapidMiner in order to ensure a holistic base with regard to the decision of optimal topics? As you mentioned, we have ofentimes not only one and only solution for optimization problems.

Thank you in advance for your feedback!

Best regards,

Fatih

Extract_Topics_optimal_k.rmp

Muhammed_Fatih_

New Altair Community Member

Mar 30, 2021

Dear community,

somebody who can give feedback on the abovementioned question regarding the evaluation measures for optimal topic determination?

Best regards,

Fatih

Extract_Topics_optimal_k.rmp

Sort by:

1 - 1 of 11

lionelderkrikor

New Altair Community Member

Accepted Answer

May 20, 2020

Hi @cmoten,

In RapidMiner, in first approximation, I see the following method (method to be confirmed by @mschmitz : Extract Topics - LDA- operator is Martin's baby ...

In attached file, an example of process to find the optimal number of topics using Optimize Parameters (Grid) operator.

Regards,

Lionel

Extract_Topics_optimal_k.rmp

View in context

🎉Community Raffle - Win $25

What's the best way to determine the number of topics in the Extract Topics from Data (LDA) operator

Find more posts tagged with

Quick Links