Classifying English Articles Based on Difficulty

alaa_albarazi
alaa_albarazi New Altair Community Member
edited November 5 in Community Q&A

Hi,Common European Framework of Reference for Languages categorise language difficulty into 3 main level group A,B,C and each level group has two subleve. The levels are (A1 Begginer, A2 elementary ..... C2 Mastery).

I have thousandes of documents that I need to group based on difficulty level using RabidMiner or Python. One concept is to use a document with the most commonly spoken words and see how close the words in an article , for example, to the most common 1000 words. But this approch ignore the gramatical difficulty. In addition to the words difficulty, I need to add Part-of-speech tagging for each article, the length of each sentence and then find a way to consider the article as easy or difficult. It would be great if there is ready to use library that can do this.

What packages could help in this? And what process do you recommend.

 

Tagged:

Answers

  • kayman
    kayman New Altair Community Member

    If you are a bit familiar with python I would recommend to use the NLTK kit, this works pretty fine (and fast) for POS functionality

     

    This post shows a practical implementation : https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Filter-Tokens-by-POS-Tags-slow/m-p/43192#M28838

  • rfuentealba
    rfuentealba New Altair Community Member

    Hi @alaa_albarazi,

     

    I would go with Python and NLTK too, as @kayman suggested. The RapidMiner extension for Text Mining can help you perform some of the preprocessing required to make it easier to analyze documents once you go with Python, and then you can make use of the Python Scripting extension to connect both. Just make sure you have the Anaconda Python Distribution installed, it already contains the packages for nltk and pattern that can help you.

     

    All the best,

     

    Rodrigo.