autotagging and autocategorizing text pieces

mayageudens
mayageudens New Altair Community Member
edited November 5 in Community Q&A

Hello Rapid Minder community!

First of all thank you for taking the time to read my question. Seocndly i apologize for my ignorance. I am totally new to data mining and i have looked around the community but did not find any other post answering my question. Perhaps this is because of my lack of knowledge. Okay so this is my problem:

 
I have around 5000 text pieces. I have categorized and tagged them. I want to build a rulebook that can autotag and autocategorize new text pieces. I have about 600 tags and about 20 categories. Every snippet can have different tags but only one category. Specifically i want:

-to analyze the text so i can automatically give this snippet the correct tags (up to 4) from a list i have made myself. 
- to analyze the text (or analyse the tags whatever is easier) and find rules for putting them in a category automatically


I have no idea how to even begin this process and i would be forever grateful if someone would be willing to guide me through this process!

Answers

  • kypexin
    kypexin New Altair Community Member

    Hi @mayageudens

     

    I could advise you on the second part, text categorizing (I have done this before as a big project for categorizing web sites based on their content and detecting restricted categories like adult, druge, weapons etc), though I am not ready at this moment to advise on tagging the texts, as it seems to be pretty different task I haven't ever aproached. 

     

    1. Start with installing "Text processing" RapidMiner extension from the marketplace as this is gonna be the main tool for you. 

    2. Study operators "PROCESS DOCUMENTS FROM FILES" or "PROCESS DOCUMENTS FROM DATA", depending on the way your text data is stored. I have actually used the first one as I had all the data stored in text files which were then read by this operator. 

    3. Important thing is that you have to vectorize text data for further classification. I used TF-IDF for creating word vectors from text files. 

     

    4. For classifying text documents I found the simplest k-NN classification algorithm could produce really good results. 

     

    Here are also some screenshots from my process I used for the task. This doesn't mean that simply copying the structure will do the trick on your data, but at least it can give you many hints about how to approach the problem.

     

    Whole process: 

     

    Screenshot 2017-11-17 13.04.09.png

     

    Process documents from files: 

     

    Screenshot 2017-11-17 13.04.18.png

     

    Vectorizing settings: 

     

    Screenshot 2017-11-17 13.04.29.png

     

    Labelling and files structure (I used a separate directory for storing documents for each category): 

     

    Screenshot 2017-11-17 13.04.44.png

     

    Cross validation: 

     

    Screenshot 2017-11-17 13.05.11.png

     

    I am also attaching slides about the whole project which I have presented on RapidMiner Wisdom 2015 conference in Ljubljana. Maybe this also might be a source of some knowledge :) 

     

     

  • Telcontar120
    Telcontar120 New Altair Community Member

    Agreed that the full scope of everything you have requested would be quite a complicated project, and quite likely beyond the scope of a forum answer.   Thanks to @kypexin for a great starting point of resources!

    A few additional comments/questions for your consideration:

    1. What is the purpose of the tagging as opposed to the classification?  It is possible (and in many cases preferable) for machine learning to do the classification component without the tagging (such as the k-nn example already given).  Is tagging really needed, or it is simply an intermediate steps to help a human?  If the algorithm can do classification without tagging, is it necessary?
    2. Do you really need 20 separate categories?  The more categories, the harder it will be for any classification model.  Could you simplify your categories to reduce the number?
    3. Do you really need a "rulebook" type of classification?  That will restrict the machine learning algorithms to tree or rule-based learners.  But many other algorithms provide good results for text classification, such as SVM, k-nn, neural nets, and even Naive Bayes, but they will not produce "rulebooks" that are human-interpretable.

      

  • mayageudens
    mayageudens New Altair Community Member

    Thank you for answering!

    I realize now this project is maybe too big for me to handle or to set up. I will give you guys a little more information. I have a website that takes information about a bulk of events and categorizes and tags them. You can see the website here: http://findout.be/.
    -As you can see, the tagging is really necessary. the category in itself is not enough to give people enough information about the event.
    -Sadly it is also impossible for me to simplify the categories. Every event takes place in a venue,  since every venues has as about 3 possible categories ( a club would almost never organize a workshop). Perhaps this will help me along?
    - I really don't need a 'rulebook' If it is possible to set up this system and link it to my website database.

    What do you guys think will be the best way to achieve this? I think i realized i need help, i would be okay with spending some money on this but my budget is very very limited..
    I truly appreciate the help you've already gave me!

  • Telcontar120
    Telcontar120 New Altair Community Member
    Yep, this seems like it is more complicated than what you would get in terms of community support, unless you are planning to do a lot of the underlying work yourself.
    One other option you have would be to post this as a project in the RapidMiner Experfy data science channel: https://www.experfy.com/channels/rapid-miner/marketplaces
    There you can post a brief project description and your requirements, provide some sample data, state your budget and timeframe, and invite qualified data scientists to bid on the project. You'll probably be pleasantly surprised as to what you can get there.