Parsing out plain text from the Reuters RCV1 corpus - XPath, XML
I have got a question regarding reading out the node content with xpath from several xml files out. I am fully aware that there are masses of resources on the internet on this issue and please believe me it really drives me crazy. I want to read out information from files from the reuters rcv1 experimental corpus. all the files in this corpus share the same information. i post the structure here as an example:
The final goal of my task is to transfer these several thousand files into a table or csv respectively. I am doing this by addressing the different node contents via der xpath address. this is absolutely no problem for all points but one, the content of <text></text>. with //newsitem/text/p/node() he always only delivers the first paragraph. what i would be looking for however would be to extract all the plain text from all paragraphs. this means the csv files should looks approximately like that:
title, headline, date, text, location titleblabla, headlineblabla, xxx, paragraph 1 paragraph 2 paragraph 3, anywhere othertitleblabla, otherheadlineblabla, otherdatexxx, other paragraph 1 paragraph 2 paragraph 3, nowhere
the respective paragraph should thus be collapsed. with the query /newsitem/text i get the whole textbody however with all the tags which is annoying with so many files.
pleas could somebody be so nice how to achieve the described goal via adressing it with xpath. the problem is also that i have to parse out other information too at the same time. thus plain text and attributes should be in the same row of the table.
tank you very much,
a desperate xml/xpath newbie
<?xml version="1.0" encoding="iso-8859-1" ?>
<newsitem itemid="1000000" id="root" date="xxx" xml:lang="en">
<title>title title title</title>
<headline>headline headline headline</headline>
<byline>Jack Daniels</byline>
<dateline>Blabla</dateline>
<text>
<p> Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 </p>
<p> Paragraph 2 Paragraph 2 Paragraph 2 Paragraph 2 Paragraph 2 </p>
<p> Paragraph 3 Paragraph 3 Paragraph 3 Paragraph 3 Paragraph 3 </p>
<p> Paragraph 4 Paragraph 4 Paragraph 4 Paragraph 4 Paragraph 4 </p>
</text>
<copyright>(c) Reuters Limited 1996</copyright>
<metadata>
<codes class="bip:countries:1.0">
<code code="MEX">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="1996-02-20"/>
</code>
</codes>
<codes class="bip:topics:1.0">
<code code="xxx">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="1996-08-20"/>
</code>
<code code="xxx">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
</code>
<code code="xxx">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
</code>
<code code="xxx">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
</code>
<code code="xxx">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
</code>
</codes>
<dc element="dc.publisher" value="Reuters Holdings Plc"/>
<dc element="dc.date.published" value="xxx"/>
<dc element="dc.source" value="Reuters"/>
<dc element="dc.creator.location" value="xxx"/>
<dc element="dc.creator.location.country.name" value="xxx"/>
<dc element="dc.source" value="Reuters"/>
</metadata>
</newsitem>
The final goal of my task is to transfer these several thousand files into a table or csv respectively. I am doing this by addressing the different node contents via der xpath address. this is absolutely no problem for all points but one, the content of <text></text>. with //newsitem/text/p/node() he always only delivers the first paragraph. what i would be looking for however would be to extract all the plain text from all paragraphs. this means the csv files should looks approximately like that:
title, headline, date, text, location titleblabla, headlineblabla, xxx, paragraph 1 paragraph 2 paragraph 3, anywhere othertitleblabla, otherheadlineblabla, otherdatexxx, other paragraph 1 paragraph 2 paragraph 3, nowhere
the respective paragraph should thus be collapsed. with the query /newsitem/text i get the whole textbody however with all the tags which is annoying with so many files.
pleas could somebody be so nice how to achieve the described goal via adressing it with xpath. the problem is also that i have to parse out other information too at the same time. thus plain text and attributes should be in the same row of the table.
tank you very much,
a desperate xml/xpath newbie