[SOLVED] xpath
amypu
New Altair Community Member
Below is an example XML.
<p>
Thisisgood
</p>
<p>
Thisisbad
</p>
<p>
This
<br>
is
<br>
acceptable
</p>
<p>
Thisisfine
</p>
I want the result:
Thisisgood
Thisisbad
Thisisacceptable
Thisisfine
I use Xpath //p/text() in Google Doc (=importXML). Ultimately, I will use //h:p/text() in Rapidminer (with Extract Information operator). This results in:
Thisisgood
Thisisbad
This is acceptable (appearing in different cells)
Thisisfine
What XPath would give me the result I need? Thank you.
<p>
Thisisgood
</p>
<p>
Thisisbad
</p>
<p>
This
<br>
is
<br>
acceptable
</p>
<p>
Thisisfine
</p>
I want the result:
Thisisgood
Thisisbad
Thisisacceptable
Thisisfine
I use Xpath //p/text() in Google Doc (=importXML). Ultimately, I will use //h:p/text() in Rapidminer (with Extract Information operator). This results in:
Thisisgood
Thisisbad
This is acceptable (appearing in different cells)
Thisisfine
What XPath would give me the result I need? Thank you.
Tagged:
0
Answers
-
Well, what result do you need?
Best regards,
Marius0 -
I would like to have the following result:
Thisisgood
Thisisbad
Thisisacceptable
Thisisfine
I DO NOT want:
This is acceptable (appearing in different cells)
Thanks.
0 -
Hi,
this is the community forum - for guaranteed answering times please consider to get a support contract. During the holidays our main focus is not on free support
However, let's focus on your issues: which versions of RapidMiner and the Text and Web extension are you using? I can't reproduce the behavior with text in different cells with Extract Information. In the latest versions Extract Information delivers only the first result node, in the case of //h:p/text() that would be "This" in the "this is acceptable" case. This is surely also not what you want. So in your case the proceeding would be to cut the document into its p tags and then extract the content of each p node with Extract Content. Optionally you can then use Replace to remove the spaces.
Please see the process below for details.
Best regards,
Marius<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document" width="90" x="112" y="30">
<parameter key="text" value="<p> Thisisgood </p> <p> Thisisbad </p> <p> This <br> is <br> acceptable </p> <p> Thisisfine </p> "/>
</operator>
<operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document" width="90" x="246" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="p" value="//h:p"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="5.3.001" expanded="true" height="60" name="Extract Content" width="90" x="179" y="30">
<parameter key="minimum_text_block_length" value="1"/>
</operator>
<operator activated="false" class="text:extract_information" compatibility="5.3.002" expanded="true" height="60" name="Extract Information" width="90" x="313" y="120">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="result" value=" //h:p/text()"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="segment" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="5.3.002" expanded="true" height="76" name="Documents to Data" width="90" x="380" y="30">
<parameter key="text_attribute" value="text"/>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0