"text-processing: extract dates from documents"
gero_schwenk
New Altair Community Member
Hello together!
I've got a question regarding the extraction of dates from documents and would be very happy for help...
My problem is as follows: I want to crawl and process webcontent for subsequent classification. Besides other things, I sure would like to organize the documents by date in order to look for trends or link them to external events. In order to do this, I need to extract dates from them (that is the html-document or the documents content itself.)
Can anybody give me a hint how to achieve this? I've seen that there is a "Extract Information"-Operator, but I don't know how to use it to achieve my goal... (I cant let it match a list of possible dates, which was my first idea...)
Any help is greatly appreciated!
Cheers,
Gero
I've got a question regarding the extraction of dates from documents and would be very happy for help...
My problem is as follows: I want to crawl and process webcontent for subsequent classification. Besides other things, I sure would like to organize the documents by date in order to look for trends or link them to external events. In order to do this, I need to extract dates from them (that is the html-document or the documents content itself.)
Can anybody give me a hint how to achieve this? I've seen that there is a "Extract Information"-Operator, but I don't know how to use it to achieve my goal... (I cant let it match a list of possible dates, which was my first idea...)
Any help is greatly appreciated!
Cheers,
Gero
Tagged:
0
Answers
-
Hi Gero,
I think a combination of cut document and extract information operator will help you. Unfortunately it is a little bit tricky to combine these to match a certain document structure. If the date is content of a div tag, try to use XPath Expressions specifiying this tag.
I will give a webinar on this topic on Wednesday, where I will show this in practice. More specific I will show how to extract posts from this forum and the poster as well as the date. There are still open slots for participating.
Greetings,
Sebastian0 -
Hi Sebastian!
Thanks for the hint and your invitation! Unfortunately, I'm on travel on wednesday, so that I will miss it... Just to ask wether I get the idea: You suppose that I
1) look, for instance, for passages which start with a number between 1 and 31 and end with a 10 using regular regions in the "cut document" operator and
2) exctract that passages using "extract information" and save them as an attribute (the date) and finally
3) join the table with the new date-attribute with the original term-document-matrix by document ID.
Am I right with this - at least in principle?
Many thanks again and cheers:
Gero0 -
Hi,
in principle yes, but you should really take a look on XPath, for example in wikipedia.
Greetings,
Sebastian0 -
hi sebastian!
thanks for the hint! I'll get into it...
cheers,
gero0