WordList -> Document Operator?

benjamin_peters
benjamin_peters New Altair Community Member
edited November 5 in Community Q&A

I'm trying to batch process a large group of individual text files which I can then tokenize. I'm using the Text Processing operator group. I'm processing the files into a single WordList which I'm then trying to tokenize. Before I can tokenize I need to convert the WordList into a document - there doesn't appear to be a Generate Document operator as is being recommended to me by Quick Fix.

 

Any ideas?

 

Sorry for the beginner's question - I'm brand new to this.

 

Very respectfully,

Ben

Best Answer

  • IngoRM
    IngoRM New Altair Community Member
    Answer ✓

    Hi Ben,

     

    No worries - we all started at some point :smileyhappy:

     

    The wordlist is actually the final result of the text processing operators, i.e. after you did all the necessary text processing like tokenization etc.  All those steps happen "inside" of the text processing operator (do you see the little icon in the bottom right corner of the operator? This indicates that this is an operator in which you can go "inside" with a double click).  

     

    I think it is probably easier if you follow along one of the following videos (there are tons more if you search on Google):

     

    https://rapidminer.com/resource/text-mining-rapidminer/

    https://www.youtube.com/watch?v=6EyQ2TWYsVw

    http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-loading.html

     

    So what is the point of the wordlist then?  This makes sure that you use exactly the same words (and only those) for scoring than for training.  This is something which is actually kind of annoying in R for example which is why I really prefer to do text analytics in RapidMiner...

     

    Cheers,

    Ingo

Answers

  • IngoRM
    IngoRM New Altair Community Member
    Answer ✓

    Hi Ben,

     

    No worries - we all started at some point :smileyhappy:

     

    The wordlist is actually the final result of the text processing operators, i.e. after you did all the necessary text processing like tokenization etc.  All those steps happen "inside" of the text processing operator (do you see the little icon in the bottom right corner of the operator? This indicates that this is an operator in which you can go "inside" with a double click).  

     

    I think it is probably easier if you follow along one of the following videos (there are tons more if you search on Google):

     

    https://rapidminer.com/resource/text-mining-rapidminer/

    https://www.youtube.com/watch?v=6EyQ2TWYsVw

    http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-loading.html

     

    So what is the point of the wordlist then?  This makes sure that you use exactly the same words (and only those) for scoring than for training.  This is something which is actually kind of annoying in R for example which is why I really prefer to do text analytics in RapidMiner...

     

    Cheers,

    Ingo