"text-processing: extract dates from documents"

Question

Hello together!
I've got a question regarding the extraction of dates from documents and would be very happy for help... :)

My problem is as follows: I want to crawl and process webcontent for subsequent classification. Besides other things, I sure would like to organize the documents by date in order to look for trends or link them to external events. In order to do this, I need to extract dates from them (that is the html-document or the documents content itself.)

Can anybody give me a hint how to achieve this? I've seen that there is a "Extract Information"-Operator, but I don't know how to use it to achieve my goal... :(  (I cant let it match a list of possible dates, which was my first idea...)

Any help is greatly appreciated!
Cheers,
Gero

gero_schwenk · Answer

hi sebastian!
thanks for the hint! I'll get into it...

cheers,
gero

land · Answer

Hi,
in principle yes, but you should really take a look on XPath, for example in wikipedia.

Greetings,
  Sebastian

gero_schwenk · Answer

Hi Sebastian!
Thanks for the hint and your invitation! Unfortunately, I'm on travel on wednesday, so that I will miss it... Just to ask wether I get the idea: You suppose that I

1) look, for instance, for passages which start with a number between 1 and 31 and end with a 10 using regular regions in the "cut document" operator and

2) exctract that passages using "extract information" and save them as an attribute (the date) and finally

3) join the table with the new date-attribute with the original term-document-matrix by document ID.

Am I right with this - at least in principle?
Many thanks again and cheers:
Gero