"How to improve Classification in Text Mining"
mdc
New Altair Community Member
I'm doing classification (15 classes) of technical papers using their abstract.
My processes are simple.
Learning:
+ TextInput
+ String Tokenizer
+ English StopwordFilter
+TokenLengthFilter
+ Binary2MultiClassLearner
+LibSVMLearner
+ModelWriter
Applying:
+TextInput
+ String Tokenizer
+ English StopwordFilter
+TokenLengthFilter
+ModelLoader
+ModelApplier
+ExcelExampleSetWriter
I get results but I'm not satisfied with them. How do I improve them? ???
I've been searching the forum and seen that feature selection is one way. There are lots of examples of FeatureSelection operator uses but I couldn't find one that writes to a model file. One example from the installer is shown but I couldn't figure out where to add the ModelWriter. Or am I thinking wrong? ???
....
+ FeatureSelection
+XValidation
+NearestNeighbors
+OperatorChain
+ModelApplier
+Performance
+ProcessLog
I'm also thinking of forcing some attributes with bigger weights. Is this a good thing to do and how do I do this?
thanks,
Matthew
My processes are simple.
Learning:
+ TextInput
+ String Tokenizer
+ English StopwordFilter
+TokenLengthFilter
+ Binary2MultiClassLearner
+LibSVMLearner
+ModelWriter
Applying:
+TextInput
+ String Tokenizer
+ English StopwordFilter
+TokenLengthFilter
+ModelLoader
+ModelApplier
+ExcelExampleSetWriter
I get results but I'm not satisfied with them. How do I improve them? ???
I've been searching the forum and seen that feature selection is one way. There are lots of examples of FeatureSelection operator uses but I couldn't find one that writes to a model file. One example from the installer is shown but I couldn't figure out where to add the ModelWriter. Or am I thinking wrong? ???
....
+ FeatureSelection
+XValidation
+NearestNeighbors
+OperatorChain
+ModelApplier
+Performance
+ProcessLog
I'm also thinking of forcing some attributes with bigger weights. Is this a good thing to do and how do I do this?
thanks,
Matthew
Tagged:
0
Answers
-
Hi,
Regarding the feature selection: What you want to do is probably not to use a ModelApplier, but rather save the attribute weights (AttributeWeightsWriter) and apply them (AttributeWeightsApplier).
Regarding the optimization of the setup: There is no general answer. Try optimizing parameters of the SVM and of the text input, try adding term n-grams, etc., maybe add a dictionary for synonyms. It very much depends on your texts.
Cheers,
Simon0 -
Hi,
Sometimes it is tempting to tweak the answer, and to forget about whether the question makes any sense. Fifteen classes? Think how many examples would be necessary to represent the problem space.
0 -
Umh. Yes. Actually, I missed that part in the original post. I agree with haddock. If you have 15 classes it is not particularly surprising that you are not satisfied with the results :-)0
-
Thanks guys for the answers. I was actually thinking of adding more classes.
Then what is the ideal number of classes for text classification? And how do you solve the problem of classifying technical documents into many categories ---is data mining not the solution?
Matthew0 -
Also, where would you add the AttributeWeightsWriter operator in this example?
+ FeatureSelection
+XValidation
+NearestNeighbors
+OperatorChain
+ModelApplier
+Performance
+ProcessLog
thanks,
Matthew0 -
Jumping in ...
Of course, Data Mining is the solution ;D
Regarding the number of classes: What haddock meant was that you need a lot of examples / documents per category to a) have enough information to distinguish the classes and b) to make any statistical reliable performance estimates. So ... how many do you have ?
Low performance values are an indication that the classes cannot be easily distinguished. Here are some rough ideas:- If the classes are the leafes of a hierachy, try to go up the hierarchy and merge classes (i.e. class "network administration" and "software engineering" into "computer science") to see whether the results get better. Performing Feature Selection on different "levels" and comparing the results manually may give you a better feeling where the problem is located
- Merge classes iteratively and perform a one vs all classification. During scoring aggregate the confidence-values from the different models (e.g. maximum, use the operator AttributeConstruction for that strategy)
regards,
Steffen0 -
Thanks for clarifying that up. I almost lost hope there.
For each category I have close to 100 examples. BTW, what is the ideal number of examples? I'm only working on the abstract section of the documents.
You're right. One reason my classification did not have good result was overlap with the categories. There are categories that I should have combined. But is it possible to do hierarchical categorization in RapidMiner? Sort of a superclass for some group of classes. So when the program can not decide between two classes, it will choose their superclass.
Do you have an example for this?Merge classes iteratively and perform a one vs all classification. During scoring aggregate the confidence-values from the different models (e.g. maximum, use the operator AttributeConstruction for that strategy)
Last question: What exactly does the "attribute weight" do? From what I understand, you apply the attribute weight to an exampleset to change the values of the attributes. What else is it use for?
thanks a lot.
Matthew
0 -
Hello again
We talking about a statistical problem here. I will give you another example: You are given a 6-sided dice and now have to decide whether this dice is fair or not. How often do you have to throw the dice to tell ? (Wikipedia - Statistical Test). In your case of 15 classes the question is interesting which performance you have to achieve to be better than random (1/15) ? I cannot cover this topic here, but there is a lot of statistical literature out there to calculate all these numbers (i.e. number of examples per category, minimum performance etc.).
For each category I have close to 100 examples. BTW, what is the ideal number of examples? I'm only working on the abstract section of the documents.
RapidMiner offers the standard t-test ... but before we start testing, let's see if we can achieve some improvements at all.
Like Haddock once said (oh, I should add this one to my signature), "RapidMiner is like Lego". You can achieve nearly anything with the right combination of operators. I will give you some hints:
But is it possible to do hierarchical categorization in RapidMiner?- AttributeConstruction in combination with ChangeAttributeRole or ExchangeAttributeRoles to aggregate labels
- ProcessBranch to realize an if-else-statement
- ValueIterator allows you to iterate over the values of your label attribute
- ProcessLog to log the performance
The AttributeWeight is an indication of how important the attribute is for distinction of the classes. In case of FeatureSelection it is always 1 or 0 (use it or dont), other operators (like InformationGainWeighting) provide a less crisp evaluation. Use the operator AttributeWeightSelection to filter the attributes to remove redudant or (worse) disturbing information.
Last question: What exactly does the "attribute weight" do? From what I understand, you apply the attribute weight to an exampleset to change the values of the attributes. What else is it use for?
As I said above, the optimal featureset may / will depend on the current "merge situation" of your categories.
I wish you success
regards,
Steffen
PS: If it wont work, try this: http://www.youtube.com/watch?v=egfCXLHfw-M ; (cannot get rid of this song )0 -
I guess I'll need to spend a lot more time with Rapidminer to become familiar with all the operators. In the meantime, I'll try the basic classification first before I go to hierarchical one.
Last question: For the Feature Selection, do you apply Feature Selection for one class only or to more than one class? What I mean is how many classes to input in the TextInput operator. I tried both. The Feature Selection with one class runs fast but the one with many classes failed. The error message shows "outofmemoryError: Java heap space". Is it ok to run Feature Selection separately for each class then combine the attribute weight results later on.
thanks,
Matthew0 -
Hello Mathew
I suppose that you mean with "one class" "one class vs all other classes", otherwise it makes no sense. As told above, the FeatureSelection tries to find a feature set which contains enough / exactly the information (limited to the information available through the data) you need to separate the classes given the current classification problem aka label.
That means the feature set will most probably change when you change the label. So it makes no sense to say which is the correct strategy, the question is what do you want to achieve and (as we have seen above) what can be learned.
If you have memory problems try the operator GeneticAlgorithmn instead, which delivers comparable results.
regards,
Steffen
PS: I have got the slight feeling that you are missing some data mining basics. I suggest this book. RapidMiner is a tool for the application to a science, so it is better to learn the science first and the tool afterwards. No offense .
0 -
Hi,
for FeatureSelection, you will need to have all classes of your classification task, because the selection optimizes the feature set for exactly this classification task. That's why, theres a learner and a Crossvalidation inside: To estimate the performance in this classification task on the current attribute set.
If your data set contains only one class, you don't need any feature at all, hence the forward selection is very fast. The performance is simply always 100%, with or without features.
If you need forward selection and the genetic selection doesn't fit your need, we provide a plugin with an improved and very memory efficient version of the FeatureSelection. You might ask for a quote, if you want.
Greetings,
Sebastian0 -
Hi,
Can you suggest a good text mining book? My application is limited to text mining and the text mining book I have is not enough to understand most of the operators in RM. I doubt though that there is a text mining book that can explain most of the RM operators just like the book you suggested. I'll buy it anyways.PS: I have got the slight feeling that you are missing some data mining basics. I suggest this book. RapidMiner is a tool for the application to a science, so it is better to learn the science first and the tool afterwards. No offense
I think Rapid-I should publish a book in data mining using RM. The content of this forum is more than enough to fill a book.
thanks,
Matthew0 -
Hi,
you won't believe but we are working on a book...
Greetings,
Sebastian0 -
That's good news. When can we expect this book?Sebastian Land wrote:
you won't believe but we are working on a book...
Matthew0 -
Hi,
this depends on our workload for other projects and such stuff. A first introductory part should be published together with the final release. Let's hope we get it done until then...
Greetings,
Sebastian0