"Example - Classify Text Language"

Question

This process will assign a language for documents and RSS feeds. After tokenizing the text it creates trigrams which are matched against the training labels. The model then scores new text and assigns a language label. Text that has a mixture of languages (i.e., Spanish and English) can end up marked as either language based on how many training examples you use. You may need to have a large number of examples for your preferred language. To mark text categories or sentiment remove the ngram operator and use topics instead (Finance, Sports, Entertainment). Using a simple Naive Bayes classifier.

B_ · Answer

Haven't done a formal comparison.  It works well enough for my tasks.

rakirk · Answer

I guess I was wondering more about comparative accuracy, primarily, how would the NBN compare to an SVM. The SVM may help account for smaller text files, but could also lead to overfitting.

B_ · Answer

Sebastian,

I just wanted to post a simple example to help people get started.

Rakirk,

Accuracy depends on how many training examples you use and how many categories to classify.  I use it to classify text between English and NotEnglish.  I have about 1000 entries marked between the two categories - some pure English, some another languge and some mixed English/other language.  Some very short text records are misclassified because of English abbreviations or mixed languages, but  it works well enough for my application.

If you import text from the web, you may have problems with coding, such as to/from UTF, etc.  You will need to preprocess the text to improve results.