🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"[SOLVED] Text mining: Two models... Which one to go for?"

User: "kasper2304"
New Altair Community Member
Updated by Jocelyn
Hi guys.

I am currently working on a research project for my master thesis about text mining within innovation. The case is that i have a lot of forum posts that i would like to classify as containing an idea or not. To get started i had 300 random posts classified from a particular sub forum that is known for people writing about ideas. What i did was to have 3 judges manually classify the 300 posts giving me a level of agreement of 33%, 66% or 100%. 66% gives me 12 positive cases and 100% gives me 8 positive cases.

I did all the data preprocessing step with R and weighted the terms as "binary". I extracted around 50 terms and by PCA this number of variables was reduced to 21. I played around a lot in rapidminer trying out different models and it turns out that a logistic regression with a dot kernel and parameter optimization does a very good job at modeling the 66% scenario and an ok job at modeling the 100% scenario. I split the trainingset into a trainingset of 200 and a testset of 100 with stratified sampling. I tried with the 33% scenario which gave me bad results in terms of many false positives. So i ended up with two candidate models that can be used as a preliminary filter for manually classifying even more posts and the model the bigger training set...

My problem is the following:
Since these two models are build on a small trainings set there is a high risk of biasing my filter which will in the end be catastrophic for the manual classification as the cases the manual classification i based on will then be extracted by a model that is biased. With that in mind i would like you to consider the characteristics following two logistic regression models powered by SVM...

Model 1 - Based on 66% scenario
Confusion matrix: Accuracy around 98% and predicts 3 of 4 positive cases correctly
Weights: Puts a lot of positive weight into one single term, around .85 whereas the other positive weighting terms are around 0.25 and below

Model 2 . Based on 100% scenario
Confusion matrix: Accuracy is the same but only manages to predict 1 of 3 positive cases correctly
Weights: Still high weight on the one single term around 0.75 but now the next terms weights 0.57 and 0.412 and below.

So which one to pick? Correct me if i am wrong but isnt this a classical example of the bias variance tradeoff and what one risk doing if one have to little of a training dataset. Which one would you pick?

Best
Kasper

Find more posts tagged with