Combine documents + weighting
simon_knoll
New Altair Community Member
Hello dear RM Team,
it would be a cool feature if the combine documents operator would have the capabillities to weight incoming documents (the terms of one document are more important then others)
all the best,
simon
it would be a cool feature if the combine documents operator would have the capabillities to weight incoming documents (the terms of one document are more important then others)
all the best,
simon
Tagged:
0
Answers
-
i worte a fast implementation for that on the combine documents operator sourcecode, which seems to be working, any comments?
/*
* RapidMiner
*
* Copyright (C) 2001-2009 by Rapid-I and the contributors
*
* Complete list of developers available at our web site:
*
* http://rapid-i.com
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Affero General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with this program. If not, see http://www.gnu.org/licenses/.
*/
package com.rapidminer.operator.text.io.transformer;
import java.util.ArrayList;
import java.util.List;
import com.rapidminer.operator.Operator;
import com.rapidminer.operator.OperatorDescription;
import com.rapidminer.operator.OperatorException;
import com.rapidminer.operator.Value;
import com.rapidminer.operator.ports.InputPortExtender;
import com.rapidminer.operator.ports.OutputPort;
import com.rapidminer.operator.text.Document;
import com.rapidminer.operator.text.Token;
/**
* This operator combines serveral documents by appending their content to a new
* document. The meta data will be added from all documents but the values of
* the first documents will be overwritten by the values of the following.
*
* @author Tobias Malbrecht, Sebastian Land
*/
public class CombineDocumentsOperator extends Operator {
private InputPortExtender documentInputPorts = new InputPortExtender(
"documents", getInputPorts());
private OutputPort documentOutput = getOutputPorts().createPort("document");
public CombineDocumentsOperator(OperatorDescription description) {
super(description);
documentInputPorts.start();
getTransformer().addGenerationRule(documentOutput, Document.class);
}
@Override
public void doWork() throws OperatorException {
List<Document> documents = documentInputPorts.getData(true);
List<Token> tokens = new ArrayList<Token>();
Document result = new Document(tokens);
//within this loop i observe the labelnames of the documents. if they entail a pattern like <label>_weigh_<weight>
//i cast <weight> to float and i'm multiplying every token's weight with <weight>
String[] splitted;
for (Document document : documents) {
String label = (String) document.getMetaDataValue("label");
splitted = label.split("_weight_");
if (splitted.length > 1) {
List<Token> newSequence = new ArrayList<Token>();
float weight = Float.parseFloat(splitted[1]);
List<Token> tseq = document.getTokenSequence();
for (Token token : tseq) {
Token t = new Token(token.getToken(), token.getWeight()
* weight);
newSequence.add(t);
System.out.println(t);
}
tokens.addAll(newSequence);
} else {
tokens.addAll(document.getTokenSequence());
}
//this line is just for beauty
document.addMetaData("label", splitted[0],
document.getMetaDataType("label"));
result.addMetaData(document);
}
documentOutput.deliver(result);
}
}0 -
Hi,
we have thought about this and think it is a good idea in general. However, assuming that you have something like "label_weight_0.7" in the annotations looks a bit weird. We should at least have a weight meta data or something similar that does not require this parsing operation. How are you constructing this string in your case?
Best,
Simon0 -
Hi Simon,
doing the weighting within the label was the easiest way for me to integrate it in my program.
Of which string are you talking about?
if you are talking about the string for the label than it goes like that:
first a bit context:
i want to cluster webservices, and for that i have documents related to the service. as not every document has the same importance, i have to weight them.
now how i build the label name:
the prefix is allways the service id, then i have "_weight_" and then i have a weight value like 0.5
e.g.: SMSService01_weight_0.5
all the best,
simon0 -
Hi Simon,
thanks for clarifying this. Aytually I was thinking about which operator you are using to construct these strings. Is it an RM operator or your own implementation?
Do you agree that this concatenation of strings is not the most elegant solution if we want to incorportate it into the release?
Best,
Simon0 -
Hi Simon,
The string is not constructed by a rapidminer operator, but by my own code, where im setting the labelnames of create document operators.
But i agree with you that for a release there should be a more elegant/general way. Maybe a metadata which can be set for every document as you mentioned in your previous post.
This was just a quick n' dirty coding which fit into my own implementation. Nevertheless also i would appreciate, if this comes into a release, that one can handle this by metadata for instance.
all the best,
Simon
0 -
Hi,
if you change that so we have an additional meta data field "weight" which always contains a number, I would copy that to the next release. What do you think?
Best,
Simon0 -
Hi Simon,
sorry for the late answer. I would appreciate that that this feature comes to the next release.
when does the next release will happen?
all the best
simon0 -
Hi,
we will include weighting into the next major release of the Text Extension. There are many ongoing changes beside this, so it might take some time.
Greetings,
Sebastian0