Can I conduct LDA model and emotion analysis with Rapidminer in Chinese text?

Polly
Polly New Altair Community Member
edited November 5 in Community Q&A
Hi everyone,

I am a newbie here and this is my question.

I need to apply Latent Dirichlet Allocation model and emotion analysis to Chinese text, but I don't know whether I can do these with Rapidminer, or which extensions I need to install further to be able to conduct the analyses.
I have already searched discussions about Chinese/mandarin, and already installed the Hanminer extensions mentioned in a discussion. But I don't think the Hanminer extensions are enough to conduct both analyses, and no one seems to put forward the question before.

Please give me some suggestions. Any ideas would be much appreciated!

Best,
Polly

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,
    from my understanding, it should work. But @yyhuang is or mandarin expert.

    Cheers,
    Martin
  • Polly
    Polly New Altair Community Member
    Hi Martin @mschmitz ,

    Thank you for your reply. 
    I read other discussions about LDA, and just to make sure, if I want to conduct Latent Dirichlet Allocation model, is 'Linear Discriminant Analysis' the operator that I should use? Is it the 'Extract Topic from Data' operator that most people mentioned in the discussions?

    Also, I wonder which operator I should use to conduct emotion analysis? Is it the Singular Value Decomposition (SVD)?

    Besides, because in a discussion about LDA that no results showed in the process, you asked whether "is this 'western' text? LDA uses a default tokenization on this tokens like spaces and so on. This may totally fail if this is not in latin alphabet?", I guess the text language has a great influence on the results. Thus, to conduct analysis with Chinese text, are there any extensions or operators I need to install or combine to use? 

    Sorry for the huge amount of questions. I would be much appreciated if you could give me some advice. Thanks in advance!

    Regards,
    Polly

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi @Polly,
    the operator you want to use is Extract Topics from Data, not Linear Discreminant analysis.

    And yes, LDA uses tokenization inside. And i just realized, that the default tokenization is on \s and not changeable, so i guess it is very hard to be applied on mandarin. As i said - I only speak German and English and am just not an expert on tokenization of mandarin/cantonese. So i don't know if it would even help if I offer the tokenization as an option.

    Cheers,
    Martin
  • Polly
    Polly New Altair Community Member
    Hi Martin, 

    Thank you for your help :smiley:
    I hope maybe @yyhuang can give me some advice on it.

    Cheers,
    Polly