"Text Mining Generate n-germs giving me bad results"

pbailey
pbailey New Altair Community Member
edited November 5 in Community Q&A

First time user of RapidMiner so be gentle.     

 

I have a file of support call notes that I'm trying to text mine to get the most used 2-word phrases.    I've watched a couple of videos and read a couple of posts on how to do this.   So I think I have everything set correct (but maybe not since it's not working).    Before using Generate n-germs,  the process returns single words just fine.   After I add Generate n-germs with max length of 2.   The below screen caps give a peek into my set up and results.

 

The Process:

 https://photos.app.goo.gl/ue98yXSvkeMKbuzq9

The results:

 https://photos.app.goo.gl/nJADtcHLDeMH2wMk9

Any help or direction would be greatly appreciated.

Answers

  • sgenzer
    sgenzer
    Altair Employee

    Hello @pbailey - welcome to the community. Don't worry...we are actually very gentle here! 

     

    So in general the best way for us to help is for you to post your process XML and at least a little bit of your data (if it's sensitive, people use "dummy" data). This way we can actually run the process, tweak, and share to others. You can find instructions on how to do this here.

     

    Looking at your images, I honestly think that it is working. Why do you think it isn't? My hunch is that you have a lot of "junk" tokens that you'll probably want to filter out like "aaacds" and "aaba" in order to get some better resultsl. That's easy to do. Just use the "Filter Tokens (by Content)" operator. You may want to play around with the parameters and use the "matches" method with regular expressions. For example:

     

    Screen Shot 2018-11-02 at 10.19.30 AM.png

     

    This will filter OUT any token that starts with the letters "aa". Regular expressions are VERY helpful in text mining. :)

     

    Good luck!

     

    Scott