Extracting Information With XPath
el_chief
New Altair Community Member
Hello,
I am having trouble getting a value from a HTML using XPATH. This is my process:
What I am doing wrong?
Works when I change the xpath to this:
Something to do with the namespace I suspect
Thanks
Neil
I am having trouble getting a value from a HTML using XPATH. This is my process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>The resulting value is "?"
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
<process expanded="true" height="505" width="415">
<operator activated="true" class="text:create_document" compatibility="5.0.6" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
<parameter key="text" value="<html> <head> <title>hello</title> </head> <body> <div class="class1">goodbye</div> </body> </html>"/>
</operator>
<operator activated="true" class="text:extract_information" compatibility="5.0.6" expanded="true" height="60" name="Extract Information" width="90" x="179" y="165">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="some_value" value="/html/head/title"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
What I am doing wrong?
Works when I change the xpath to this:
/h:html/h:head/h:title/text()Is there a way to get rid of that "h:" ?
Something to do with the namespace I suspect
Thanks
Neil
Tagged:
0
Answers
-
Hi el chief,
you almost answered your question yourself. The different behaviour (with or without "h:") indeed is depending on the namespace. If you take a look at the "Extract Information" operator there is one expert parameter "assume html". This allows a bit more tolerance in nesting elements than XML does. The parser "repairs" documents by adding missing tags and creating a valid XML-like code. Simultaneously HTML elements get bound to the respective namespace and the identifier "h" is assigned (compare to operator documentation on namespaces: Specifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h.).
If you uncheck the "assume html" parameter no namespace binding will be done automatically and your process works as you posted it above. You can define your own namespaces and identifiers by the namespaces parameter list if you like. For plain XML-like code with custom elements you don't need to define a namespace if you want to accept all elements without checking them against a namespace.
Regards,
Matthias1 -
sounds good. will try it without "assume html", and no h:0