Generate meta data for a document
Dear RapidMiniers,
I'd like to know whether it is possible to generate meta date for a document?`Here is the context:
In my web/text mining project I have an example set, in which one attribute is a URL request link. I need to mine the content to which this link points and store it back in the original example set.
My current implementation is: I fork my process such that my original example set flows through one branch, while in the other branch I loop over the values of my URL request link attribute, access the content using the Get Page operator, and extract the information I need from there. The problem is: how do I merge the new example set I got here with the original one?
My idea was to use the merge the two example sets on the URL request link as a unique identifier. However, I don't know how to do it :-( What I tried is to use the URL I get as a part of meta data attacher dot the output of Get Operator. However, that URL is slightly modified with respect to the one I used as the input to Get Page:
URL in the meta data of the Get Page output:
Although it looks like only a small part of the link is modified, according to one of your developers, we cannot have any guarantee on the modification pattern.
My next idea is to somehow attach the input URL which I can access as %{loop_value} macro as meta data of the document I get out of Get Page. However, I didn't find a way to do this.
Does anyone have any idea how I could go about my problem? Any inputs would be much appreciated!
Cheers,
Snežana