Text mining: Datacleaning and model ensembling?
kasper2304
New Altair Community Member
Hi guys.
I need some help on elaboration a little on my choice of method and how optimally to do data cleaning and create and apply several trained models.
My case is the following:
Dataset: 2998 cases -> 337 positives & 2661 negatives
Partitioning: 85% for training and validation and 15% for testing -> 2262/286 for train and validation & 399/51 for testing
What i have read is that one can cluster negative cases and then train a model using the separate clusters with the positive cases for combining in the end. Is that a method anyone applied or can anyone explain a variant that can be performed in rapid miner.
I also looked into how to do data cleaning but i have no clue about which one to use for text mining as rapidminer provides several techniques.
Until now my method have simply been to downsample the majority class of my training and validation set providing the best results on my test set. I am using a SVM with linear kernel and the RBF kernel have not yielded better results. I did 3-grams, and stopword removal for preprocessing my text.
Best
Kasper
I need some help on elaboration a little on my choice of method and how optimally to do data cleaning and create and apply several trained models.
My case is the following:
Dataset: 2998 cases -> 337 positives & 2661 negatives
Partitioning: 85% for training and validation and 15% for testing -> 2262/286 for train and validation & 399/51 for testing
What i have read is that one can cluster negative cases and then train a model using the separate clusters with the positive cases for combining in the end. Is that a method anyone applied or can anyone explain a variant that can be performed in rapid miner.
I also looked into how to do data cleaning but i have no clue about which one to use for text mining as rapidminer provides several techniques.
Until now my method have simply been to downsample the majority class of my training and validation set providing the best results on my test set. I am using a SVM with linear kernel and the RBF kernel have not yielded better results. I did 3-grams, and stopword removal for preprocessing my text.
Best
Kasper
Tagged:
0