Text Classification/Labeling using Description
Hi All,
I am new to RapidMiner and would like to perform labeling on a 'Long Description' column using a CSV file. I will be working with 2 columns mainly, 'Long Description' and 'Label'. The 'Label' is applied based on the 'Long Description' value. I have 1000 rows out of which 80% of 'Label' values are already applied as a training set. I wish to populate the remaining 20% 'Label' values using the 'Long Description' value.
All Label Values -
Cancellation |
Price Increase |
Normal Payment |
Payoff |
Price Decrease |
Installer Installation Issue |
Past Due Payment |
Change Order |
Incentive Payment |
Assumption |
Completion Certificate |
Interest |
Referral |
Example -
Long Description - Please review change order in installation phase - loan amount increasing from USD 21;851.00 to USD 24;501.00
Label - Price Increase
Long Description - Cancellation request with SPV Assignment
Label - Cancellation
How should I proceed with this using RapidMiner and what should be the steps to perform the same?
Thanks
Best Answer
-
You should search the forums for some of the threads on text mining, you will find a lot of helpful information there. This is a classic classification problem. You'll use your "long description" as the text, process and tokenize it, and then use the resulting word vectors to predict the label.
However, you may find that you need to consolidate labels. You have a lot of distinct values, and classification problems increase in complexity when you have have a lot of potential individual label values to predict. So you may find better success by grouping some of the existing labels together into larger categories. That's something that you will need to play around with manually, there's not an easy way to automate that in RapidMiner.
1
Answers
-
You should search the forums for some of the threads on text mining, you will find a lot of helpful information there. This is a classic classification problem. You'll use your "long description" as the text, process and tokenize it, and then use the resulting word vectors to predict the label.
However, you may find that you need to consolidate labels. You have a lot of distinct values, and classification problems increase in complexity when you have have a lot of potential individual label values to predict. So you may find better success by grouping some of the existing labels together into larger categories. That's something that you will need to play around with manually, there's not an easy way to automate that in RapidMiner.
1