"Information Retrieval an weighting by html tags"
simon_knoll
New Altair Community Member
Answers
-
Hi Simon,
I included it today into the new TextProcessing Extension of RapidMiner 5. The current Plugin does not support this, so you might wait until we release RapidMiner 5...
Greetings,
Sebastian0 -
Hello Sebastian,
could you do me a favor and show me a short example, how i can apply weight for html tags or which operators i need?
regards,
Simon Knoll0 -
Hi Simon,
the Text Processing Extension contains Operators for extracting XPath querries. It's called Generate Extract. If you have stored the contents of a web page in an ExampleSet, you might use this operator to extract the content of a h4 tag as a new attribute. If you take a look at the current version of the Process Documents from Data operator, it allows you to select attributes from where the text should be taken. In this list, you can also assign a weight to each attribute. Combining these two things should suit your needs.
If this does not proof helpful, we could think of implementing some sort of weight applier, that will assing weights on tokens if it fulfills some condition.
Greetings,
Sebastian0 -
thanks for your help.
but i've got some problems with the "generate extract" operator. more precise, im not getting any results, furthermore im getting empty results :-)
maybe im using it in the wrong way
regards,
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true" height="746" width="1091">
<operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="313" y="165">
<parameter key="text" value="<html> <title>Hallo Titel</title> <h4>Hallo Überschrift 3</h4> <h3>Hallo Überschrift 3</h3> <p><h4>Ein H4</h4> <span>in einem Paragraph</span></p> </html>"/>
</operator>
<operator activated="true" class="text:process_documents" expanded="true" height="94" name="Process Documents" width="90" x="581" y="75">
<process expanded="true" height="724" width="770">
<connect from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:generate_extract" expanded="true" height="60" name="Generate Extract" width="90" x="782" y="75">
<parameter key="source_attribute" value="source_ATTR"/>
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="title_html" value="//h:title/text()"/>
</list>
<list key="namespaces"/>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Generate Extract" to_port="Example Set"/>
<connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
simon0 -
Hi Simon,
the problem with your setup is, that the source attribute does not exists. My problem with that is, that the operator does not complain about this, but instead simply doesn't deliver anything. I changed that behavior...
For getting the text into an attribute, you can uncheck the create_word_vector parameter in the Process Document and instead add Keep_text. Then a new attribute called text will be added containing the text. You can select this for the generate extract operator and then it works as below:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
Greetings,
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true" height="746" width="1091">
<operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="112" y="75">
<parameter key="text" value="<html> <title>Hallo Titel</title> <h4>Hallo Überschrift 3</h4> <h3>Hallo Überschrift 3</h3> <p><h4>Ein H4</h4> <span>in einem Paragraph</span></p> </html>"/>
</operator>
<operator activated="true" class="text:process_documents" expanded="true" height="94" name="Process Documents" width="90" x="246" y="75">
<parameter key="create_word_vector" value="false"/>
<parameter key="keep_text" value="true"/>
<process expanded="true" height="724" width="770">
<connect from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:generate_extract" expanded="true" height="60" name="Generate Extract" width="90" x="380" y="75">
<parameter key="source_attribute" value="text"/>
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="title_html" value="//h:title/text()"/>
</list>
<list key="namespaces"/>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Generate Extract" to_port="Example Set"/>
<connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Sebastian0 -
thanks, now it works also for me. but still i got some questions
why im gettin' here just one result and not every href entry seperated by ";"
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<parameter key="logverbosity" value="3"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="1"/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="parallelize_main_process" value="false"/>
<process expanded="true" height="629" width="950">
<operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="112" y="255">
<parameter key="text" value="<html> 	<a href="1">Details</a> 	<a href="2">Details</a> 	<a href="3">Details</a> 	<a href="4">Details</a> 	<a href="5">Details</a> 	<a href="6">Details</a> 	<a href="7">Details</a> 	<a href="8">Details</a> 	<a href="9">Details</a> 	<a href="0">Details</a> </html> "/>
<parameter key="add label" value="false"/>
<parameter key="label_type" value="0"/>
</operator>
<operator activated="true" class="text:process_documents" expanded="true" height="94" name="Process Documents" width="90" x="447" y="255">
<parameter key="create_word_vector" value="false"/>
<parameter key="vector_creation" value="0"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="0"/>
<parameter key="prunde_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_rank" value="5.0"/>
<parameter key="prune_above_rank" value="5.0"/>
<parameter key="datamanagement" value="7"/>
<parameter key="parallelize_vector_creation" value="false"/>
<process expanded="true" height="629" width="950">
<connect from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:generate_extract" expanded="true" height="60" name="Generate Extract" width="90" x="648" y="255">
<parameter key="source_attribute" value="text"/>
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<parameter key="attribute_type" value="Nominal"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="DetailsPage" value="//h:a[text()='Details']/@href"/>
</list>
<list key="namespaces"/>
<parameter key="ignore_CDATA" value="true"/>
<parameter key="assume_html" value="true"/>
<parameter key="value_seperator" value=";"/>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Generate Extract" to_port="Example Set"/>
<connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
regards
simon0 -
Hi Simon,
as the operator documentation tries to say, if a query results in an enumeration of items like for example "en,de,fr", then this values are separated using the given characters. But anyway you have to enter the exact search expression more than once to specify more than one attribute name. Where should the operator store the second value, if you enter only one attribute?
Greetings,
Sebastian0 -
hello sebastian,
unfortunatly i dont understand your suggestion. so what i want to achive is following:
having this "html" code
i want to extract all the href values (1,2,3,4,5,6,7,8,9,0)
<html>
<a href="1">Details</a>
<a href="2">Details</a>
<a href="3">Details</a>
<a href="4">Details</a>
<a href="5">Details</a>
<a href="6">Details</a>
<a href="7">Details</a>
<a href="8">Details</a>
<a href="9">Details</a>
<a href="0">Details</a>
</html>
now if i use following xpath expression//a/@href
from the xpath point of view i get with this query all the href's.
to check this you simply can test it at http://www.mizar.dk/XPath/Default.aspx
so my question is now, how i can achive that in rapidminer?
0 -
Hi,
how do you want to store the values of the href after having them retrieved? Should each href be a single example or do you want to have multiple attributes?
This is important, because the ways totally differ.
Greetings,
Sebastian0 -
for me, both ways would be interesting, as i have to extract different features for different purposes.
Greetings,
Simon
0 -
Hi Simon,
sorry for the late answer, but I simply didn't find the time to answer questions here in the forum in the meanwhile. Here's a process that will show you how both ways work:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
Please keep in mind, that there's the restriction, that each example of an example set must have the same attributes, so creating attributes depending on a the content of a text cannot be done!
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="296" width="480">
<operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="3" y="45">
<parameter key="text" value="<html> 	<a href="1">Details</a> 	<a href="2">Details</a> 	<a href="3">Details</a> 	<a href="4">Details</a> 	<a href="5">Details</a> 	<a href="6">Details</a> 	<a href="7">Details</a> 	<a href="8">Details</a> 	<a href="9">Details</a> 	<a href="0">Details</a> </html>"/>
</operator>
<operator activated="true" class="text:documents_to_data" expanded="true" height="76" name="Documents to Data" width="90" x="112" y="120">
<parameter key="text_attribute" value="text"/>
</operator>
<operator activated="true" class="multiply" expanded="true" height="94" name="Multiply" width="90" x="246" y="120"/>
<operator activated="true" class="text:process_document_from_data" expanded="true" height="76" name="Process Documents from Data" width="90" x="380" y="210">
<parameter key="create_word_vector" value="false"/>
<list key="specify_weights"/>
<process expanded="true" height="585" width="904">
<operator activated="true" class="text:cut_document" expanded="true" height="60" name="Cut Document" width="90" x="112" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="unimportant" value="//a/@href"/>
</list>
<list key="namespaces"/>
<parameter key="assume_html" value="false"/>
<process expanded="true" height="585" width="904">
<operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information" width="90" x="45" y="30">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="hrefNumber" value="(.*)"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
</operator>
<connect from_port="segment" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="document" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:generate_extract" expanded="true" height="60" name="Generate Extract" width="90" x="380" y="75">
<parameter key="source_attribute" value="text"/>
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="AttributeName1" value="//a[1]"/>
<parameter key="AttributeName2" value="//a[2]"/>
</list>
<list key="namespaces"/>
<parameter key="assume_html" value="false"/>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Generate Extract" to_port="Example Set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 2"/>
<connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Greetings,
Sebastian0 -
hi sebastian.
thank you, this realy helped me.
do you know where i can find the AttributeWeights and AttributeWeightsApplier operators at the rapidminer gui?
greetings,
simon0 -
Hi,
there are several weighting operators available in the Modeling / Attribute Weighting group. You can the use scale by weights operator for applying these weights.
Greetings,
Sebastian0 -
Hi Sebastian,
thank you, but i did not figured out how i can "create" weights for different attributes and pipe them for instance to the "scale by weights" operator
best regards
simon0 -
Hi,
take a look at the Data to Weights operator. With this you can convert an example set to a weight vector. You could create an example set having this weights for example with the logging funtionality and finally turn the log into a ExampleSet by using the log to data operator.
Greetings,
Sebastian0 -
Hey Sebastian,
thank you for your answer, but i dont get it.
So i have a process like this:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
in this process i extracted some features from a html document(for simplicity in this process generated by the "Create Document" operator).
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="546" width="1016">
<operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="45" y="165">
<parameter key="text" value="<html> 	<head><title>Der Titel ist sehr toll</title></head> 	<a href="http://f12010.info">formel1</a> 	 <a href="http://dsds-2009.info">und einen dritten link</a> 	<a href="http://simonknoll.com">semmel</a> 	<title>Wir Haben auch einen zweitet Titel</title> </html>"/>
<parameter key="label_type" value="numeric"/>
</operator>
<operator activated="true" class="multiply" expanded="true" height="94" name="Multiply" width="90" x="179" y="165"/>
<operator activated="true" class="text:process_documents" expanded="true" height="94" name="Process Documents (2)" width="90" x="313" y="255">
<parameter key="create_word_vector" value="false"/>
<process expanded="true">
<operator activated="true" class="text:cut_document" expanded="true" height="60" name="Cut Document (2)" width="90" x="394" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="html_linktext" value="//h:a/text()"/>
</list>
<list key="namespaces"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information (2)" width="90" x="394" y="30">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="use_it" value="(.*)"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
</operator>
<connect from_port="segment" to_op="Extract Information (2)" to_port="document"/>
<connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="document" to_op="Cut Document (2)" to_port="document"/>
<connect from_op="Cut Document (2)" from_port="documents" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:process_documents" expanded="true" height="94" name="Process Documents" width="90" x="313" y="30">
<parameter key="create_word_vector" value="false"/>
<process expanded="true">
<operator activated="true" class="text:cut_document" expanded="true" height="60" name="Cut Document" width="90" x="246" y="165">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="html_title" value="//h:title/text()"/>
</list>
<list key="namespaces"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information" width="90" x="246" y="30">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="use_it" value="(.*)"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
</operator>
<connect from_port="segment" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="document" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="generate_id" expanded="true" height="76" name="Generate ID" width="90" x="447" y="30"/>
<operator activated="true" class="generate_id" expanded="true" height="76" name="Generate ID (2)" width="90" x="447" y="255"/>
<operator activated="true" class="union" expanded="true" height="76" name="Union" width="90" x="581" y="120"/>
<connect from_op="Create Document" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Multiply" from_port="output 2" to_op="Process Documents (2)" to_port="documents 1"/>
<connect from_op="Process Documents (2)" from_port="example set" to_op="Generate ID (2)" to_port="example set input"/>
<connect from_op="Process Documents" from_port="example set" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Union" to_port="example set 1"/>
<connect from_op="Generate ID (2)" from_port="example set output" to_op="Union" to_port="example set 2"/>
<connect from_op="Union" from_port="union" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
these extracted features result in the following example set
now my question. how i can add weighting for the different features that i extracted (e.g weight html_title with 2 and html_linktext with 1) wich then maybe could result in such a example set(or how ever a weightng looks like, i added a weight column just to get the point):
Row No. id query_key use_it
-----------------------------------------------------------------
1 1.0 html_title Der Titel ist sehr toll
2 2.0 html_title Wir Haben auch einen zweitet Titel
3 1.0 html_linktext formel1
4 2.0 html_linktext und einen dritten link
5 3.0 html_linktext semmel
thanks in advance
Row No. id query_key use_it weight
---------------------------------------------------------------------------------
1 1.0 html_title Der Titel ist sehr toll 2
2 2.0 html_title Wir Haben auch einen zweitet Titel 2
3 1.0 html_linktext formel1 1
4 2.0 html_linktext und einen dritten link 1
5 3.0 html_linktext semmel 1
simon0 -
Hi Simon,
if this weight should only depend on the query_key this is no problem. Simply use the [tt]Generate Attributes[/tt] operator and use [tt]if(query_key="html_title",2,1)[/tt] as expression. Of course, you can nest the [tt]if(...,...,...)[/tt] expressions as you would like to.
Kind regards,
Tobias0 -
Thank you for your advice.
my question is now, how can i feed a k-means algorithm with this data, if i want to cluster the documents regarding the extracted features. if im just giving the resulting exampleset as input, it clusters every single example for its own. but i want to cluster the documents and not the extractions.
any advice?
best regards
simon0 -
maybe i post a screenshot of an example set
here i have an exampleset with several examples describing 2 different objects.
now if i want to apply a clustering algorithm on this, and i want to cluster these 2 objects (in reality there are obviously more than just 2 objects) and not every single example, how i have to do?
best regards
simon knoll0