Best practices for text mining an academic text

Question

I have long, complex texts which I want to classify to categories such as psychology, history etc.

What processes would you recommend to use? Eg. tokenization, n-grams etc.

Thank you

yoram_schaffer · Answer

Thank you very much @kypexin!
I will tryo the different setting, having your illustration as a source of inspiration. Yes, I have quite good samples as I'm working on it for a long time (actually, started with RapidMiner afterr seeing, to my surprise, how limited is Amazon ML in terms of applying different processes).

Will report and share with the community once I have some insights about what brngs better results, at least for academic texts.

kypexin · Answer

Hi @yoram_schaffer

Well, basically I did more or less the same task - categorizing site contents (actually means, text data) into separate predefined categories. I used all the standard things there (like tokenization, stemming) inb my process, see screenshot #2. One thing though I didn't use were n-grams, as it would be pretty memory-consuming; otherwise I see that your problem is actually VERY similar, so I would recommend that you begin with re-creating the process setup as I have described and see the results (believe me, it really works! :) ). I think one crucial thing here is to have a good training set, which means manually categorized documents corpus (and the complexity of this part depends on how much unique categories and total documents you have).

In a more general sense, text mining is one of most popular topics so you can find a lot of posts on this forum if you search for 'text mining' and similar. Also look for operators description from Text Mining RM extension, everything basically is built around it. And Google suggests pretty much different resources about 'text mining rapidminer', and even some tutorial videos.

yoram_schaffer · Answer

kypexin for taking the time to reply to me.

I read your other reply thoroughly. Did you ever try using some of the other processes, like stemming, locating POS?

The texts I'm analyzing are academic in nature - i.e - I'm not trying to analyze client behavior, not do I try to locate a dependency between different factors (e.g - weather against purchase habits).

My intention is to categorize texts according to the topic they are dealing with. The texts are usually 100-300 words.

I understand it's beyond your experience. Do you have any idea for a resource which my be helpful on that?