Hello everybody,
I am currently expanding my RapidMiner process for extraction information from several websites. I added some new URLs today and got some problems with a page containing a (valid) CDATA-declaration. The XHTML document contains a little script-block somewhere in the body area:
<script type="text/javascript">
// <![CDATA[
function popOut() {
window.open("http://newyorker.radio.de/micro/newyorker/index.jsp","radiode","width=473,height=629,scrollbars=no,resizable=no,location=no");
}
// ]]>
</script>
This seems to be valid XHTML syntax to me. I retrieve this document via "Get Page" and want to extract some parts with "Cut Document" using XPath expressions. As soon as I enter any (valid) expression to cut the document into pieces this results in the following error:
Exception: org.jdom.IllegalDataException
Message: The data "
// <![CDATA[
function popOut() {
window.open("http://newyorker.radio.de/micro/newyorker/index.jsp","radiode","width=473,height=629,scrollbars=no,resizable=no,location=no");
}
// ]]>
" is not legal for a JDOM CDATA section: CDATA cannot internally contain a CDATA ending delimiter (]]>).
Stack trace:
org.jdom.CDATA.setText(CDATA.java:121)
org.jdom.CDATA.<init>(CDATA.java:95)
org.jdom.DefaultJDOMFactory.cdata(DefaultJDOMFactory.java:97)
org.jdom.input.SAXHandler.flushCharacters(SAXHandler.java:652)
org.jdom.input.SAXHandler.flushCharacters(SAXHandler.java:623)
org.jdom.input.SAXHandler.endElement(SAXHandler.java:678)
org.ccil.cowan.tagsoup.Parser.pop(Parser.java:605)
org.ccil.cowan.tagsoup.Parser.etag_basic(Parser.java:574)
org.ccil.cowan.tagsoup.Parser.etag(Parser.java:521)
org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:488)
org.ccil.cowan.tagsoup.Parser.parse(Parser.java:399)
org.jdom.input.SAXBuilder.build(SAXBuilder.java:453)
org.jdom.input.SAXBuilder.build(SAXBuilder.java:851)
com.rapidminer.operator.text.tools.queries.XPathQuery.getAllMatches(XPathQuery.java:108)
com.rapidminer.operator.text.io.segmenter.DocumentSegmentingOperator.doWork(DocumentSegmentingOperator.java:89)
com.rapidminer.operator.Operator.execute(Operator.java:768)
com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:368)
com.rapidminer.operator.Operator.execute(Operator.java:768)
com.rapidminer.Process.run(Process.java:863)
com.rapidminer.Process.run(Process.java:770)
com.rapidminer.Process.run(Process.java:765)
com.rapidminer.Process.run(Process.java:755)
com.rapidminer.gui.ProcessThread.run(ProcessThread.java:65)
The script block containing the CDATA-declaration is neither important for my information extraction nor for my XPath expressions. I could of course preprocess the document to remove the problematic part (it works well as soon as both CDATA lines are removed) but I think this should work anyway as long as I have some valid (X)HTML code. Is there something wrong or might this be a bug somewhere in the XPath-interpreter or in interaction with RapidMiner? Some information about this would be useful for my future work and the problems I will be facing.
Thanks for any hints.
colo