keep unique id when tokenize

Question

when reading a csv file with two columns: ID and MESSAGE, is it possible to keep the ID field when using the operator Process Documents from Data?

I use this operator to tokenize messages but I want to be able to keep the relation between the words and the message with the unique ID column.

So when tokenize the following scentence:

ID                                   scentence
1                                    Rapidminer rocks the world!

I want the result
ID
1     Rapidminer
1     rocks
1     the
1     world

Skirzynski · Answer

Hey,

please do not attach your process and data as screenshots. This isn't helpful for us at all to reproduce your problem. Please read this posting which explains how to provide a process as XML. You can use the code tags as well to attach a small fraction of sample data (this can be a part of your real data or some artificial data with the same problem) which does not work as expected.

Regarding your question: The "Process Documents from Data" operator yields a word vector where each row represents a message and every word with a value greater than zero indicates that this word is contained in this message. And as I said, usually other rows will be retained. If your post your process this will clarify a lot of things I think.

Marcin

OCA · Answer

Hi Marcin,

Thanks for you reply.

See links below for screenshots:
http://postimage.org/image/4paepjga3/
http://postimage.org/image/s5eadmbh1/
http://postimage.org/image/8c3c90ahz/
http://postimage.org/image/pvvx4p4br/

Let's say I have 80 000 messages from different users posted all over one year. Now I want to analyze which subjects were hot in a certain time frame by a selected set of users from a certain age. I want to do this with another data visualisation tool in which I can make selections on the fly. To be able to do this I need the relation between the message, the user and the time it is posted.

Now when I tokenize all 80 000 messages, I have one set with the most frequent words but there is no relation which words were used in the message they came from. Just the total count. Is there some sort of way that I can keep the relation with the message?

Skirzynski · Answer

The result of the "Process Documents from Data" operator is actually keeping the other columns. And besides the result of this operator is not the table you have posted, but a word vector, i.e. an example set with an attribute for every word.

Can you post your process and a small fraction of your data to clarify you problems?