Wrong results

Question

I have a set of news items (XML format) concerning the following categories (in Dutch): Auto, Economie, Politiek, Sport.

These XML items are read with the Read XML operator, resulting in an example set with Categorie as label attribute and Text and Title as regular attributes.

I apply Naive Bayes, Cross Validation and Performance operator and get funny performance results.

The imported XML content is classified by humans and should be accurate.

So what is going wrong? It looks like if I make a systematical error in my approach.

If I replace Bayes by k-NN, it gives the same performance results.

Who has some clues to  resolve this?

I have attached the RM process and the XML data in a zip file.

Thomas_Ott · Accepted Answer

@AKO Yes, you are getting terrible results. You are not even Text Processing the data or cleaning it up to extract content. My suggestions is to install the Text Processing and Web Mining extension, then troll through the Community for some Text Processing posts and processes. By doing some basic Text Processing I increased your accuracy and recall, so that's where you need to focus.