"[SOLVED] Text mining: Two models... Which one to go for?"
Hi guys.
I am currently working on a research project for my master thesis about text mining within innovation. The case is that i have a lot of forum posts that i would like to classify as containing an idea or not. To get started i had 300 random posts classified from a particular sub forum that is known for people writing about ideas. What i did was to have 3 judges manually classify the 300 posts giving me a level of agreement of 33%, 66% or 100%. 66% gives me 12 positive cases and 100% gives me 8 positive cases.
I did all the data preprocessing step with R and weighted the terms as "binary". I extracted around 50 terms and by PCA this number of variables was reduced to 21. I played around a lot in rapidminer trying out different models and it turns out that a logistic regression with a dot kernel and parameter optimization does a very good job at modeling the 66% scenario and an ok job at modeling the 100% scenario. I split the trainingset into a trainingset of 200 and a testset of 100 with stratified sampling. I tried with the 33% scenario which gave me bad results in terms of many false positives. So i ended up with two candidate models that can be used as a preliminary filter for manually classifying even more posts and the model the bigger training set...
My problem is the following:
Since these two models are build on a small trainings set there is a high risk of biasing my filter which will in the end be catastrophic for the manual classification as the cases the manual classification i based on will then be extracted by a model that is biased. With that in mind i would like you to consider the characteristics following two logistic regression models powered by SVM...
Model 1 - Based on 66% scenario
Confusion matrix: Accuracy around 98% and predicts 3 of 4 positive cases correctly
Weights: Puts a lot of positive weight into one single term, around .85 whereas the other positive weighting terms are around 0.25 and below
Model 2 . Based on 100% scenario
Confusion matrix: Accuracy is the same but only manages to predict 1 of 3 positive cases correctly
Weights: Still high weight on the one single term around 0.75 but now the next terms weights 0.57 and 0.412 and below.
So which one to pick? Correct me if i am wrong but isnt this a classical example of the bias variance tradeoff and what one risk doing if one have to little of a training dataset. Which one would you pick?
Best
Kasper
I am currently working on a research project for my master thesis about text mining within innovation. The case is that i have a lot of forum posts that i would like to classify as containing an idea or not. To get started i had 300 random posts classified from a particular sub forum that is known for people writing about ideas. What i did was to have 3 judges manually classify the 300 posts giving me a level of agreement of 33%, 66% or 100%. 66% gives me 12 positive cases and 100% gives me 8 positive cases.
I did all the data preprocessing step with R and weighted the terms as "binary". I extracted around 50 terms and by PCA this number of variables was reduced to 21. I played around a lot in rapidminer trying out different models and it turns out that a logistic regression with a dot kernel and parameter optimization does a very good job at modeling the 66% scenario and an ok job at modeling the 100% scenario. I split the trainingset into a trainingset of 200 and a testset of 100 with stratified sampling. I tried with the 33% scenario which gave me bad results in terms of many false positives. So i ended up with two candidate models that can be used as a preliminary filter for manually classifying even more posts and the model the bigger training set...
My problem is the following:
Since these two models are build on a small trainings set there is a high risk of biasing my filter which will in the end be catastrophic for the manual classification as the cases the manual classification i based on will then be extracted by a model that is biased. With that in mind i would like you to consider the characteristics following two logistic regression models powered by SVM...
Model 1 - Based on 66% scenario
Confusion matrix: Accuracy around 98% and predicts 3 of 4 positive cases correctly
Weights: Puts a lot of positive weight into one single term, around .85 whereas the other positive weighting terms are around 0.25 and below
Model 2 . Based on 100% scenario
Confusion matrix: Accuracy is the same but only manages to predict 1 of 3 positive cases correctly
Weights: Still high weight on the one single term around 0.75 but now the next terms weights 0.57 and 0.412 and below.
So which one to pick? Correct me if i am wrong but isnt this a classical example of the bias variance tradeoff and what one risk doing if one have to little of a training dataset. Which one would you pick?
Best
Kasper