How to use word2vec and lstm to classify sequences of tokens

desa
desa New Altair Community Member
edited November 5 in Community Q&A
As a first step towards building a chatbot using rapidminer (for educational purposes), I try to loop over a collection of documents with tokenized texts. Now, each document containing a sequence of tokens should have 1) each token translated to a word2vec embedding (learned over the complete collection of all documents) and then 2) passed the resulting embedded token to a rnn (using deep learning lstm-layer).

I cannot figure out how to design such a process. One problem I keep stumbling upon is the inability to pass something into an complex process. E.g., when I learned the word2vec embedding, and now want to loop over the collection of documents, I need to pass the word2vec model into the loop operator. Another challenge I am facing is that I then need to loop over the individual tokens of a single document, apply to word2vec model to translate the token into the word2vec embedding, and then pass on the embedded token to a deep learning process which then contains the lstm-layer. Somehow I keep getting errors because inputs don't match what is expected, e.g. a collection of documents passed to a loop operator then passes on individual documents inside the loop to a document window operator. However, the document windowing operator says this is not the right input.

However. Has anyone done anything like that before and could share their process? My first attempt is to have the lstm layer connected to a fully connected layer and then classify the input document according to a person that spoke its content. I am using romeo and juliet where I extracted all passages spoken by each one of them. The goal is to use rnn to classify texts into whether spoken by romeo or juliet. I want to compare the performance to a more tradditional approach using tfidf vectorisation of documents.

Looking forward to any bit of help :smile:

Answers

  • pschlunder
    pschlunder New Altair Community Member
    edited April 2020
    Hey,
    I've put together a small process. Mind you, it will not have enough samples for proper training, but it could help you in setting up the process yourself. I guess you need to convert your text in a way, that the words you want to convert are in one attribute each time, so that the "Apply Word2Vec Model" part will directly convert all of the words at once.
    Furthermore, there is a new "ExampleSet To Tensor" operator in version 0.9.3. of the Deep Learning Extension, that allows you to convert an ExampleSet into a tensor. With that, you can convert word2vec representations into tensors and start training your LSTM.
    Therefore, you need to have two ID columns. One representing which sequence certain Examples/Rows belong to (here for example to which sentence), and one representing each step of a sequence (here words). I've annotated the process, hope it helps you somehow.

    Please note, that you can also read in an existing Word2Vec model using the "Read Word2Vec" operator. For most scenarios this should be fine, as long as all tokens are part of the huge corpus utilized for training the initial Word2Vec model (you can download it here).

    Hope this helps,
    Philipp