Xpath Problem regarding multiple extracts
TK
New Altair Community Member
Hi. I can´t manage to extract several - for example - titles or other text included in a html-doc with the rapidminer xpath-processor. If you use the xpath querry //h:title/text() it only extracts the first title in the extract information operator and the following metadata-attribute, while other xpath-visualizers (xpather etc.) will show f.e. all three titles with this querry. Is this a Rapidminer-Problem, an operator-limitation or just stupidity?
Tagged:
0
Answers
-
Hi,
I also faced the problem before, that for XPath and regular expressions only the first matches are delivered. I am not sure if there is a possibility to get a collection of all matches in RapidMiner (Java should allow this easily).
What operator do you use for the extraction? I worked around this problem by using the document type (instead of ExampleSet) and extracting all matches as single documents in a collection for the original document. You can easily achieve this by using the "Cut Document" operator. This worked well in my case, but I would also want to know if there are other possibilities to handle multiple matches.
Regards,
Matthias0 -
Hi. I use the Extract Information-Op inside the Cut Document-Op. I think i can´t use another "Cut-Doc"-Op because of the mass of the generated Documents. The Extract Info-Op is the best one i.m.o. for my needs (extracting author, citation and posting-text from a web-forum, which is cutted into seperated documents per posting via Cut-Doc). Anyone got an idea for my problem? I just can´t fix it0
-
Hi,
actually there should be no problem to use multiple Cut documents inside each other. Memory consumption of Documents should be fairly low...
Actually we could add an option to include all matches inside a meta information. I will note this down for the next version.
Greetings,
Sebastian0 -
Ok, great (xpath is somehow useless in the Extract Information-Op without multiple output into metadata or attribute^^). So i guess i´ll have to use the Cut-Operator and merge the Documents at the end of the process. Thanks for your support!0
-
Hi TK,
if you have the forum page split into documents I suppose that for each posting the HTML structure should always be the same. In this case you could directly address the XPatch matches you desire for a special information using the proper predicates.
Here a simple example:<div class="posting">
If you want to grab the author you can use the default first match: /div/div - but you won't get the other information this way.
<div>Author</div>
<div>Time</div>
<div>Text</div>
</div>
Using predicates you can extract them easily into different attributes (which in this case might be an advantage because you get named entities instead of a list of matches):Author: /div/div[1]
Note that the use of a numeric predicate [1] simply is a short-hand for the boolean predicate [position()=1].
Time: /div/div[2]
Text: /div/div[3]
Regards,
Matthias0 -
Hi Matthias,
thx, that´s what i tried to do. But there´s a Problem if the forum post contains a unknown number of n citations (which is f.e. queried by a //div[@style=italic] xpath-expression) or - in your example - n authors. The [1] will show you the first author, but how to handle multiple authors (for a unknown n].
Usually, xpath queries (f.e. starting with //) will show you every author or citations which fits to the xpath automatically, but Rapidminer just writes the first existing author in the metadata-attribute (with the extract information op; the cut-doc-op seems to work correct, although you have to find a way to merge the cutted n documents together --> every single of the n citation has to be assigned to the correct posting it belongs to ???).
0 -
Hi TK,
if you have an unknown number of relevant elements the predicate of course isn't of much help. The use of "Cut Documents" is the only way I know, that RapidMiner can deliver some sort of the usual enumeration of multiple matches.
To assign the citations (inner "Cut Documents" operator) to the postings (outer "Cut Documents" operator) you could perhaps assign an id (unique for each posting, equal for all citations belonging to a posting). The use of a counting variable for the outer "Cut Documents" should do the job. Working with the document type might be hard in this case - perhaps you should consider a conversion to ExampleSets.
Regards,
Matthias0 -
Yep, i did it like this (it seems to work), but i have to implement the id-recognition). Any improvements in the process are appreciated
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.0.11" expanded="true" name="Process">
<process expanded="true" height="415" width="685">
<operator activated="true" class="web:get_webpage" compatibility="5.0.4" expanded="true" height="60" name="Get Page" width="90" x="45" y="255">
<parameter key="url" value="http://forum.spiegel.de/showthread.php?t=22981&page=6"/>
<list key="query_parameters"/>
</operator>
<operator activated="true" class="text:cut_document" compatibility="5.0.7" expanded="true" height="60" name="Cut Document" width="90" x="313" y="120">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Segmenter" value="/h:html/h:body/h:div[4]/h:div[1]/h:div[2]/h:div[2]/h:div[2]/h:div/h:div/h:div/h:div/h:table"/>
</list>
<list key="namespaces">
<parameter key="xx" value="xml"/>
</list>
<parameter key="ignore_CDATA" value="false"/>
<list key="index_queries"/>
<process expanded="true" height="499" width="750">
<operator activated="true" class="text:remove_document_parts" compatibility="5.0.7" expanded="true" height="60" name="Remove Document Parts" width="90" x="112" y="75">
<parameter key="deletion_regex" value="(<br clear="none" />)"/>
</operator>
<operator activated="true" class="multiply" compatibility="5.0.11" expanded="true" height="94" name="Multiply" width="90" x="279" y="97"/>
<operator activated="true" class="text:cut_document" compatibility="5.0.7" expanded="true" height="60" name="Cut Document (2)" width="90" x="447" y="120">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Zitate" value="//h:div[@style='font-style:italic']/text()"/>
</list>
<list key="namespaces"/>
<parameter key="ignore_CDATA" value="false"/>
<list key="index_queries"/>
<process expanded="true" height="499" width="750">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:cut_document" compatibility="5.0.7" expanded="true" height="60" name="Cut Document (3)" width="90" x="447" y="255">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Posting" value="//h:table/h:tr[2]/h:td[2]/h:div[2]/text()[2]|/h:table/h:tbody/h:tr[2]/h:td[2]/h:div[2]/text()"/>
</list>
<list key="namespaces"/>
<parameter key="ignore_CDATA" value="false"/>
<list key="index_queries"/>
<process expanded="true" height="499" width="750">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="segment" to_op="Remove Document Parts" to_port="document"/>
<connect from_op="Remove Document Parts" from_port="document" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Cut Document (2)" to_port="document"/>
<connect from_op="Multiply" from_port="output 2" to_op="Cut Document (3)" to_port="document"/>
<connect from_op="Cut Document (2)" from_port="documents" to_port="document 1"/>
<connect from_op="Cut Document (3)" from_port="documents" to_port="document 2"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
<portSpacing port="sink_document 3" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="5.0.7" expanded="true" height="76" name="Documents to Data" width="90" x="581" y="120">
<parameter key="text_attribute" value="Testattr"/>
<parameter key="label_attribute" value="testattribut"/>
</operator>
<connect from_op="Get Page" from_port="output" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Hi SebastianSebastian Land wrote:
Hi,
actually there should be no problem to use multiple Cut documents inside each other. Memory consumption of Documents should be fairly low...
Actually we could add an option to include all matches inside a meta information. I will note this down for the next version.
has this option been implemented yet? I am just working with xmls that have multiple matches (e.g. http://dblp.uni-trier.de/rec/bibtex/journals/umuai/ParamythisWM10.xml with 3x <author>).0 -
Uhhmn,
sorry, but I don't think so. But if you are familiar with Java, you probably could add this very easily yourself?
Greetings,
Sebastian0