Problem with XPath and CDATA-declaration
colo
New Altair Community Member
Hello everybody,
I am currently expanding my RapidMiner process for extraction information from several websites. I added some new URLs today and got some problems with a page containing a (valid) CDATA-declaration. The XHTML document contains a little script-block somewhere in the body area:
Thanks for any hints.
colo
I am currently expanding my RapidMiner process for extraction information from several websites. I added some new URLs today and got some problems with a page containing a (valid) CDATA-declaration. The XHTML document contains a little script-block somewhere in the body area:
This seems to be valid XHTML syntax to me. I retrieve this document via "Get Page" and want to extract some parts with "Cut Document" using XPath expressions. As soon as I enter any (valid) expression to cut the document into pieces this results in the following error:
<script type="text/javascript">
// <![CDATA[
function popOut() {
window.open("http://newyorker.radio.de/micro/newyorker/index.jsp","radiode","width=473,height=629,scrollbars=no,resizable=no,location=no");
}
// ]]>
</script>
The script block containing the CDATA-declaration is neither important for my information extraction nor for my XPath expressions. I could of course preprocess the document to remove the problematic part (it works well as soon as both CDATA lines are removed) but I think this should work anyway as long as I have some valid (X)HTML code. Is there something wrong or might this be a bug somewhere in the XPath-interpreter or in interaction with RapidMiner? Some information about this would be useful for my future work and the problems I will be facing.
Exception: org.jdom.IllegalDataException
Message: The data "
// <![CDATA[
function popOut() {
window.open("http://newyorker.radio.de/micro/newyorker/index.jsp","radiode","width=473,height=629,scrollbars=no,resizable=no,location=no");
}
// ]]>
" is not legal for a JDOM CDATA section: CDATA cannot internally contain a CDATA ending delimiter (]]>).
Stack trace:
org.jdom.CDATA.setText(CDATA.java:121)
org.jdom.CDATA.<init>(CDATA.java:95)
org.jdom.DefaultJDOMFactory.cdata(DefaultJDOMFactory.java:97)
org.jdom.input.SAXHandler.flushCharacters(SAXHandler.java:652)
org.jdom.input.SAXHandler.flushCharacters(SAXHandler.java:623)
org.jdom.input.SAXHandler.endElement(SAXHandler.java:678)
org.ccil.cowan.tagsoup.Parser.pop(Parser.java:605)
org.ccil.cowan.tagsoup.Parser.etag_basic(Parser.java:574)
org.ccil.cowan.tagsoup.Parser.etag(Parser.java:521)
org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:488)
org.ccil.cowan.tagsoup.Parser.parse(Parser.java:399)
org.jdom.input.SAXBuilder.build(SAXBuilder.java:453)
org.jdom.input.SAXBuilder.build(SAXBuilder.java:851)
com.rapidminer.operator.text.tools.queries.XPathQuery.getAllMatches(XPathQuery.java:108)
com.rapidminer.operator.text.io.segmenter.DocumentSegmentingOperator.doWork(DocumentSegmentingOperator.java:89)
com.rapidminer.operator.Operator.execute(Operator.java:768)
com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:368)
com.rapidminer.operator.Operator.execute(Operator.java:768)
com.rapidminer.Process.run(Process.java:863)
com.rapidminer.Process.run(Process.java:770)
com.rapidminer.Process.run(Process.java:765)
com.rapidminer.Process.run(Process.java:755)
com.rapidminer.gui.ProcessThread.run(ProcessThread.java:65)
Thanks for any hints.
colo
Tagged:
0
Answers
-
Hi,
I once had this problem, too, but I can't remember how I solved it. Since you already know how to avoid this problem, I would recommend going this way and adding this to our bug tracker. Please attach the source code of the respective document.
Greetings,
Sebastian0 -
Hi Sebastian,
thanks for that note, I have sent a bug request.
In the meantime I discovered another problem during XPath usage. I got some HTML table and want to extract only the rows via "Cut Document" using XPath to identify the rows (TR elements). This works fine for the first few but then there follow two hidden input fields after every table row. This document isn't using XHTML syntax so this should be allowed:
I manually removed the inputs between the first two table rows of the above example wich results in selecting both lines via XPath. All the following TRs are ignored although they should match my XPath expression. Selecting all of the rows works just fine in Firefinder (Firebug extension) wich I use for testing my expressions on online examples before integrating them into RapidMiner operators.
<tr>
<td scope="row" width="1%" align="center"></td>
<td class="COLUMNns1" width="1%" align="center">
<label for="checkbox0">
</label>
<input name="chkJobClientIds" value="322766|23|0" onclick="javascript:onCheckJob(0,322766,'23','0');" type="checkbox"></td>
<td class="COLUMNns1" colspan="1">
<a href="https://sjobs.brassring.com/1031/ASP/TG/cim_jobdetail.asp?SID=^asYSrg3/_slp_rhc_vL0u8/KnyZNVDK_slp_rhc_KKIC/tvQQ4aBBX1Bo6EHZV7KEdbpgagyEJrVx7JV93f6Ls1jpXWz_C_R__L_F_gzM_slp_rhc_K0u29TytHxBjjVYjZTqVGUVjYy8=&amp;jobId=322766&amp;type=search&amp;JobReqLang=23&amp;recordstart=1&amp;JobSiteId=5041&amp;JobSiteInfo=322766_5041&amp;GQId=0">2 x Mitarbeiter/in Restaurant 60 h/Mon. ab 1.07.2010 befristet gesucht</a></td>
<td class="COLUMNns1" colspan="1">Teilzeit </td>
<td class="COLUMNns1" colspan="1">Befristete </td>
<td class="COLUMNns1" colspan="1">Restaurant (IKEA Food) </td>
<td class="COLUMNns1" colspan="1">Leipzig</td>
<td class="COLUMNns1" colspan="1">10/07/2010</td>
<td class="COLUMNns1" colspan="2" width="10%">28/06/2010</td>
</tr>
<tr>
<td scope="row" width="1%" align="center"></td>
<td class="COLUMNsel1" width="1%" align="center">
<label for="checkbox1">
</label>
<input name="chkJobClientIds" value="322796|23|0" onclick="javascript:onCheckJob(1,322796,'23','0');" type="checkbox"></td>
<td class="COLUMNsel1" colspan="1">
<a href="https://sjobs.brassring.com/1031/ASP/TG/cim_jobdetail.asp?SID=^asYSrg3/_slp_rhc_vL0u8/KnyZNVDK_slp_rhc_KKIC/tvQQ4aBBX1Bo6EHZV7KEdbpgagyEJrVx7JV93f6Ls1jpXWz_C_R__L_F_gzM_slp_rhc_K0u29TytHxBjjVYjZTqVGUVjYy8=&amp;jobId=322796&amp;type=search&amp;JobReqLang=23&amp;recordstart=1&amp;JobSiteId=5041&amp;JobSiteInfo=322796_5041&amp;GQId=0">Mitarbeiter/in Kundenservice/Kasse 65Std./ab sofort befr. bis 28.2.11</a></td>
<td class="COLUMNsel1" colspan="1">Teilzeit </td>
<td class="COLUMNsel1" colspan="1">Befristete </td>
<td class="COLUMNsel1" colspan="1">Kundenservice </td>
<td class="COLUMNsel1" colspan="1">Braunschweig</td>
<td class="COLUMNsel1" colspan="1">05/07/2010</td>
<td class="COLUMNsel1" colspan="2" width="10%">28/06/2010</td>
</tr>
<input name="hidJobSiteId" value="322796_5041" type="hidden">
<input name="hidJobGQId" value="0" type="hidden">
<tr>
<td scope="row" width="1%" align="center"></td>
<td class="COLUMNns1" width="1%" align="center">
<label for="checkbox2">
</label>
<input name="chkJobClientIds" value="330660|23|0" onclick="javascript:onCheckJob(2,330660,'23','0');" type="checkbox"></td>
<td class="COLUMNns1" colspan="1">
<a href="https://sjobs.brassring.com/1031/ASP/TG/cim_jobdetail.asp?SID=^asYSrg3/_slp_rhc_vL0u8/KnyZNVDK_slp_rhc_KKIC/tvQQ4aBBX1Bo6EHZV7KEdbpgagyEJrVx7JV93f6Ls1jpXWz_C_R__L_F_gzM_slp_rhc_K0u29TytHxBjjVYjZTqVGUVjYy8=&amp;jobId=330660&amp;type=search&amp;JobReqLang=23&amp;recordstart=1&amp;JobSiteId=5041&amp;JobSiteInfo=330660_5041&amp;GQId=0">Mitarbeiter/in Buchhaltung gesucht, 30h/Monat, befr. bis 31.08.2011</a></td>
<td class="COLUMNns1" colspan="1">Teilzeit </td>
<td class="COLUMNns1" colspan="1">Befristete </td>
<td class="COLUMNns1" colspan="1">Administration</td>
<td class="COLUMNns1" colspan="1">Regensburg</td>
<td class="COLUMNns1" colspan="1">10/07/2010</td>
<td class="COLUMNns1" colspan="2" width="10%">26/06/2010</td>
</tr>
<input name="hidJobSiteId" value="330660_5041" type="hidden">
<input name="hidJobGQId" value="0" type="hidden">
...
Any suggestions?0 -
Hi,
which input type did you chose for the documents? If you select HTML, the file will go through a parser correcting illegal constructions. While this enables XPath in many situations in the first place, it changes the original document structure from time to time, so that the expressions do not do not work as expected...If you know it's legal XML, you should select XML instead of HTML.
Greetings,
Sebastian0 -
Hello Sebastian,
thanks for your feedback. I selected HTML because the document is not even valid XHTML (the standalone input elements without self-closing tags for example). In this case I would wish a XPath-parser that does not change the original code but perhaps generates a few more errors with some not resolvable expressions. I remember that this automatical code correction bothered me before. Is it really essential for the correct operation of the parser? If not, is there some opportunity to deactivate this behavior? Or can I at least somehow take a look at the corrected code? This could perhaps enable me to find a cleaner way than deleting all input elements in the table I want to access via XPath.
Thanks for any help.
Regards,
colo0 -
Hi,
as I said, you can turn of the correction by changing the input type to xml. But available XPath parser rely on standard conform XML and hence won't work if you have damaged XML and turn the correction of.
The corrected text is available in the document view if you make a break point inside the process documents operator.
I already thought about adding some sort of XPath editor inside this view, giving you the same comfort as usual to FireFox users, perhaps together with a proper XML view as I stumbled about this problem last time...
Greetings,
Sebastian
0