"[SOLVED] Text mining: Two models... Which one to go for?"

kasper2304
kasper2304 New Altair Community Member
edited November 5 in Community Q&A
Hi guys.

I am currently working on a research project for my master thesis about text mining within innovation. The case is that i have a lot of forum posts that i would like to classify as containing an idea or not. To get started i had 300 random posts classified from a particular sub forum that is known for people writing about ideas. What i did was to have 3 judges manually classify the 300 posts giving me a level of agreement of 33%, 66% or 100%. 66% gives me 12 positive cases and 100% gives me 8 positive cases.

I did all the data preprocessing step with R and weighted the terms as "binary". I extracted around 50 terms and by PCA this number of variables was reduced to 21. I played around a lot in rapidminer trying out different models and it turns out that a logistic regression with a dot kernel and parameter optimization does a very good job at modeling the 66% scenario and an ok job at modeling the 100% scenario. I split the trainingset into a trainingset of 200 and a testset of 100 with stratified sampling. I tried with the 33% scenario which gave me bad results in terms of many false positives. So i ended up with two candidate models that can be used as a preliminary filter for manually classifying even more posts and the model the bigger training set...

My problem is the following:
Since these two models are build on a small trainings set there is a high risk of biasing my filter which will in the end be catastrophic for the manual classification as the cases the manual classification i based on will then be extracted by a model that is biased. With that in mind i would like you to consider the characteristics following two logistic regression models powered by SVM...

Model 1 - Based on 66% scenario
Confusion matrix: Accuracy around 98% and predicts 3 of 4 positive cases correctly
Weights: Puts a lot of positive weight into one single term, around .85 whereas the other positive weighting terms are around 0.25 and below

Model 2 . Based on 100% scenario
Confusion matrix: Accuracy is the same but only manages to predict 1 of 3 positive cases correctly
Weights: Still high weight on the one single term around 0.75 but now the next terms weights 0.57 and 0.412 and below.

So which one to pick? Correct me if i am wrong but isnt this a classical example of the bias variance tradeoff and what one risk doing if one have to little of a training dataset. Which one would you pick?

Best
Kasper

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi Kasper,

    One word on the accuracy: remember that the accuracy is the probability that a new example is classified correctly by your model. if I get you right, your data is highly unbalanced (i.e. much more negative than positive examples). For ease of speaking let's assume that you have 90% negatives. Then a model that always predicts "negative" will already result in an accuracy of 90%. In addition to plain accuracy you may want to have a look on other performance measure, e.g. in the field of ROC analysis. Here a sensible scalar performance value is e.g. the AUC (area under the curve).

    Furthermore, the Split Validation is not very robust - if by chance your test set contains many "easy" examples, you will overestimate your performance. Consider to use a Cross Validation instead (the operator in RapidMiner is called X-Validation).

    Which model to pick? Well, in first place the X-Validation will give you a more robust performance estimation. Maybe you can base your preliminary choice on that. Afterwards, the usual way to go could be to deploy the model (i.e. use it in the real world), and feed the result of the manual judge which in your case occurs after model application back into the model creation/training data.

    Best, Marius
  • kasper2304
    kasper2304 New Altair Community Member
    Hi Marius.

    Again thanks for a nice reply.

    Yes, my data is highly unbalanced. 300 cases with 8 to 12 positive cases gives me a ratio of 2.66% to 4% positive cases. I have looked into the the ROC/AUC chart RapidMiner provides as well as the ROC comparison node (that i cant get to work).

    Regarding the split validation i tried it out. I think that the reason why i did not stick with it is because in my data mining course we just partitioned the sample ourselves, which is also common practice in Rattle. Sound like i need to take a deeper look into the X-validation node as it sound to be the optimal way to go.

    The idea was exactly to use this "bad" model to raise our chances of getting positive cases and thereby make our training set more "balanced". The only concern i have is that by choosing a highly biased model now will also result in many biased new training cases. Thats why i wanted go with the model where the weights were more evenly distributed instead of choosing the model that puts a lot of weights into one term...

    Best
    Kasper