nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Siemens Community Catalyst Program

The Siemens Community Catalyst program was co-created with our community to acknowledge technology leaders who consistently contribute to the Siemens Community. Nominations are accepted on a rolling basis.

Nominate Now

⚠️Please Note

Technical discussions have been migrated to the Siemens Support Center as Knowledge Base (KB) articles; please note that this content is no longer maintained and may be outdated, so for the latest information, log in to the Siemens Support Center, search online, or contact our support team.

Search for Content in Siemens Support Center

Text mining: Datacleaning and model ensembling?

kasper2304

Hi guys.

I need some help on elaboration a little on my choice of method and how optimally to do data cleaning and create and apply several trained models.

My case is the following:

Dataset: 2998 cases -> 337 positives & 2661 negatives
Partitioning: 85% for training and validation and 15% for testing -> 2262/286 for train and validation & 399/51 for testing

What i have read is that one can cluster negative cases and then train a model using the separate clusters with the positive cases for combining in the end. Is that a method anyone applied or can anyone explain a variant that can be performed in rapid miner.

I also looked into how to do data cleaning but i have no clue about which one to use for text mining as rapidminer provides several techniques.

Until now my method have simply been to downsample the majority class of my training and validation set providing the best results on my test set. I am using a SVM with linear kernel and the RBF kernel have not yielded better results. I did 3-grams, and stopword removal for preprocessing my text.

Best
Kasper