[SOLVED] JDom - Comment data cannot start with a hyphen
Scotty
New Altair Community Member
Good Afternoon,
I am using xpath to extract information from html documents that have been saved on my PC from a webcrawl.
Everything seems to works OK except occasionally I get the following error
the data "-10" is not legal for a JDOM comment: Comment data cannot start with a hyphen
When inspecting the html I find
<!---10-->
which seems to be causing the problem.
Any ideas of how to get around this?
Many Thanks
Scott
I am using xpath to extract information from html documents that have been saved on my PC from a webcrawl.
Everything seems to works OK except occasionally I get the following error
the data "-10" is not legal for a JDOM comment: Comment data cannot start with a hyphen
When inspecting the html I find
<!---10-->
which seems to be causing the problem.
Any ideas of how to get around this?
Many Thanks
Scott
Tagged:
0
Answers
-
It would appear that this was a bug in jdom 1.0 that has been fixed in jdom 1.1.
Removing check that a comment not start with a hyphen. A careful reading
of production 15 in the XML 1.0 spec indicates leading hyphens are in
fact allowed.
taken from http://jdom.markmail.org/message/b45honrv3crcmqux posted 4 years ago.
If this is the case, what does one need to do to solve the problem?
Thanks
S0 -
Here is an example of the problem.
Any ideas?
Thanks
Scott<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.014">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process">
<process expanded="true" height="449" width="710">
<operator activated="true" class="web:get_webpage" compatibility="5.1.004" expanded="true" height="60" name="Get Page" width="90" x="45" y="75">
<parameter key="url" value="http://www.talktalkmembers.com/forums/forumdisplay.php?f=9&order=desc&page=13"/>
<list key="query_parameters"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.1.003" expanded="true" height="94" name="Process Documents" width="90" x="179" y="75">
<parameter key="create_word_vector" value="false"/>
<process expanded="true" height="449" width="710">
<operator activated="true" class="text:extract_information" compatibility="5.1.003" expanded="true" height="60" name="Extract Information" width="90" x="112" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Test" value="//h:h1/text()"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Hi, i have the same problem. I'm using the dataset from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-51/www/co-training/data/, when i try to load the data into rapidminer using the 'process documents to files' operator it gives me the same error. Then inside the operator i put the 'remove documents parts' operator and i put the following regular expression <![^>]*> as the parameter for the operator, but the error is still showing.
I will appreciate your help. thanks0 -
Hi,
thanks for the hint. At the moment we are using JDom 1.0 but we will update it to the latest library version soon.
Until then you could use the 'Remove documents parts' operator with this regular expression: <!---.*-->
This removes every comment with a hypen at the beginning thus allowing the extract information operator to work correctly.
Regards,
Nils0