[SOLVED] Strange issues with Replace tokens operator
kayman
New Altair Community Member
Hi there I'm running some processes to convert e-tailer information into structured tables, works well but I stumbled upon something weird.
In short the process is as follows (using web mining / text processing functions)
1. Crawl the site(s), clean up the pages, and make it one big XHTML
2. Use some XSLT to convert the XHTML to structured XML
3. Use some additional Replace Tokens operators to perform finetuning
4. Store
The issue is that when I store the final content in a repository there are 2 content streams inside, the one before the replace tokens block, and the cleaned one.
In other words, the results overview is giving the correct information, but the repository is storing the input data of the replace token operator on top of the output data.
The xml looks as below before entering the replace token operator
Any idea what might be causing this ? Do I need to convert the content before storing it or is this a bug ?
full working example can be downloaded here : http://www.freeuploadsite.com/do.php?id=70466
code used :
In short the process is as follows (using web mining / text processing functions)
1. Crawl the site(s), clean up the pages, and make it one big XHTML
2. Use some XSLT to convert the XHTML to structured XML
3. Use some additional Replace Tokens operators to perform finetuning
4. Store
The issue is that when I store the final content in a repository there are 2 content streams inside, the one before the replace tokens block, and the cleaned one.
In other words, the results overview is giving the correct information, but the repository is storing the input data of the replace token operator on top of the output data.
The xml looks as below before entering the replace token operator
the process will replace the productRef element with a few other elements and looks as follows when leaving it :
<models>
<model>
.. some elements ..
<productRef>MODELNAME, 121 CM (48 ZOLL), 1080P (FULL HD) LED FERNSEHER</productRef>
.. some elements..
</model>
</models>
Exactly as intended, the productRef element has been replaced with some other elements. However, this is only when using the 'show document result'. When storing this doc data only the original data is provided. When analyzing the repository (using a text editor) it is clear the original data is stored first, followed by the cleaned data. But the cleaned data seems un-accessible.
<models>
<model>
.. some elements ..
<productSubCategory>Full HD</productSubCategory>
<productScreenSize>48 inch</productScreenSize>
<productName>MODELNAME</productName>
.. some elements..
</model>
</models>
Any idea what might be causing this ? Do I need to convert the content before storing it or is this a bug ?
full working example can be downloaded here : http://www.freeuploadsite.com/do.php?id=70466
code used :
<?xml version="1.0" encoding="UTF-8" standalone="no"?>if I save as xml the content looks as follows :
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="5.3.015" expanded="true" height="60" name="Retrieve xhtml" width="90" x="45" y="30">
<parameter key="repository_entry" value="xhtml"/>
</operator>
<operator activated="true" class="subprocess" compatibility="5.3.015" expanded="true" height="76" name="make XML" width="90" x="179" y="30">
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="makeTags1" width="90" x="45" y="120">
<parameter key="text" value="<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> 	<xsl:output method="xml" version="1.0" encoding="UTF-8" omit-xml-declaration="yes"/> 	<xsl:template match="/"> 		<models> 				<xsl:for-each select="//body/article"> 					<xsl:variable name="theAccount"> 						<xsl:value-of select="normalize-space(div[@id='account'])"/> 					</xsl:variable> 					<xsl:variable name="theUrl"> 						<xsl:value-of select="normalize-space(div[@id='url'])"/> 					</xsl:variable> 					<xsl:variable name="theCat"> 						<xsl:text>Television</xsl:text> 					</xsl:variable> 					<xsl:variable name="theSubCat"> 					</xsl:variable> 					<xsl:variable name="theActivePage"> 						<xsl:choose> 							<xsl:when test="contains($theUrl,'p=')"> 								<xsl:value-of select="substring-before(substring-after($theUrl,'p='),'&amp;')"/> 							</xsl:when> 							<xsl:otherwise> 								<xsl:text>1</xsl:text> 							</xsl:otherwise> 						</xsl:choose> 					</xsl:variable> 					<xsl:for-each select=".//section[@id='san_resultSection']/article"> 						<model> 							<productAccount> 								<xsl:value-of select="$theAccount"/> 							</productAccount> 							<productURL> 								<xsl:value-of select="$theUrl"/> 							</productURL> 							<productCategory> 								<xsl:value-of select="$theCat"/> 							</productCategory> 							<!--<productSubCategory> 								<xsl:value-of select="$theSubCat"/> 							</productSubCategory>--> 							<productActivePage> 								<xsl:value-of select="normalize-space($theActivePage)"/> 							</productActivePage> 							<productRank> 								<xsl:value-of select="position()"/> 							</productRank> 							<productCode> 								<xsl:value-of select="@data-productid&quot;/> 							</productCode> 							<productDescription> 								<xsl:value-of select="normalize-space(a)"/> 							</productDescription> 							<productPage> 								<xsl:value-of select="concat('https://www.otto.de',a/@href)"/> 							</productPage> 							<productVarID> 								<xsl:value-of select=".//@data-price-defining-variation-id&quot;/> 							</productVarID> 							<productJSON> 								<xsl:value-of select="concat('https://www.otto.de',a/@data-json-target)"/> 							</productJSON> 						</model> 					</xsl:for-each> 				</xsl:for-each> 		</models> 	</xsl:template> </xsl:stylesheet>"/>
</operator>
<operator activated="true" class="text:process_xslt" compatibility="5.3.002" expanded="true" height="76" name="Process (6)" width="90" x="180" y="30"/>
<operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="getModelInfo (3)" width="90" x="179" y="120">
<parameter key="text" value="<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> 	<xsl:output method="xml" version="1.0" encoding="UTF-8" omit-xml-declaration="yes"/> 	<xsl:param name="smallcase" select="'abcdefghijklmnopqrstuvwxyz®™'"/> 	<xsl:param name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ '"/> 	<xsl:template match="/"> 		<xsl:copy> 			<xsl:apply-templates select="@*|node()&quot;/> 		</xsl:copy> 	</xsl:template> 	<xsl:template match="@*|node()&quot;> 		<xsl:copy> 			<xsl:apply-templates select="@*|node()&quot;/> 		</xsl:copy> 	</xsl:template> 	<xsl:template match="productDescription"> 		<xsl:copy> 			<xsl:apply-templates select="@*|node()&quot;/> 		</xsl:copy> 		<xsl:call-template name="makeProductDescription"/> 	</xsl:template> 	<xsl:template name="makeProductDescription"> 		<xsl:variable name="theString" select="translate(normalize-space(.), $smallcase, $uppercase)"/> 		<xsl:variable name="theBrand"> 			<xsl:choose> 				<xsl:when test="starts-with($theString,'ACOUSTIC SOLUTIONS')"> 					<xsl:text>ACOUSTIC SOLUTIONS</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'ADVANCE ACOUSTIC')"> 					<xsl:text>ADVANCE ACOUSTIC</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'AUDIO PRO')"> 					<xsl:text>AUDIO PRO</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'AUDIO TECHNICA')"> 					<xsl:text>AUDIO TECHNICA</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'BANG &amp; OLUFSEN')"> 					<xsl:text>BANG &amp; OLUFSEN</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'DIGITAL SILENCE')"> 					<xsl:text>DIGITAL SILENCE</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'ENERGY SISTEM')"> 					<xsl:text>ENERGY SISTEM</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'FRESH N REBEL')"> 					<xsl:text>FRESH N REBEL</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'GO PRO')"> 					<xsl:text>GO PRO</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'HARMAN KARDON')"> 					<xsl:text>HARMAN KARDON</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'HOUSE OF MARLEY')"> 					<xsl:text>HOUSE OF MARLEY</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'JOHN LEWIS')"> 					<xsl:text>JOHN LEWIS</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'LIQUID IMAGE')"> 					<xsl:text>LIQUID IMAGE</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'KIDZ GEAR')"> 					<xsl:text>KIDZ GEAR</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'MONSTER CABLE')"> 					<xsl:text>MONSTER CABLE</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'SMS AUDIO')"> 					<xsl:text>SMS AUDIO</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'TED BAKER')"> 					<xsl:text>TED BAKER</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'THUMBS UP')"> 					<xsl:text>THUMBS UP</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'ULTIMATE EARS')"> 					<xsl:text>ULTIMATE EARS</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'URBAN REVOLT')"> 					<xsl:text>URBAN REVOLT</xsl:text> 				</xsl:when> 				<xsl:when test="starts-with($theString,'VIEW QUEST')"> 					<xsl:text>VIEW QUEST</xsl:text> 				</xsl:when> 				<xsl:otherwise> 					<xsl:value-of select="substring-before($theString,' ')"/> 				</xsl:otherwise> 			</xsl:choose> 		</xsl:variable> 		<productBrand> 			<xsl:value-of select="$theBrand"/> 		</productBrand> 		<productRef> 			<xsl:value-of select="normalize-space(substring-after($theString,$theBrand))"/> 		</productRef> 	</xsl:template> </xsl:stylesheet>"/>
</operator>
<operator activated="true" class="text:process_xslt" compatibility="5.3.002" expanded="true" height="76" name="Process (7)" width="90" x="313" y="30"/>
<operator activated="true" class="store" compatibility="5.3.015" expanded="true" height="60" name="sub" width="90" x="447" y="30">
<parameter key="repository_entry" value="xml_tmp"/>
</operator>
<operator activated="true" class="text:replace_tokens" compatibility="5.3.002" expanded="true" height="60" name="clean4model (2)" width="90" x="581" y="30">
<list key="replace_dictionary">
<parameter key="<productRef>(?:BRAVIA\s|VIERA\s|FINE\sARTS\s)([^<]*)</productRef>" value="<productRef>$1</productRef>"/>
<parameter key="<productRef>([^,]*), [0-9]+ CM [^<]*</productRef>" value="<productName>$1</productName>"/>
<parameter key="<productRef>[^<]*»([^«]*)«[^<]*</productRef>" value="<productName>$1</productName>"/>
<parameter key="<productRef>[^<]*\s"([^"]*)"\s[^<]*<\/productRef>" value="<productName>$1</productName>"/>
<parameter key="<productName>(\d{2})\s?(\w{3})\s?(\d{3,4})\s?(\w{2})([^<]*)</productName>" value="<productName>$1$2$3$4$5</productName>"/>
<parameter key="<productRef>([A-Z0-9]+-[A-Z0-9]+)[^<]*<\/productRef>" value="<productName>$1</productName>"/>
<parameter key="<productRef>(.*?)<\/productRef>" value="<productName>ZZZ-UNDEFINED | $1</productName>"/>
</list>
</operator>
<operator activated="true" class="text:replace_tokens" compatibility="5.3.002" expanded="true" height="60" name="clean4features (2)" width="90" x="715" y="30">
<list key="replace_dictionary">
<parameter key="(<productDescription>[^<]*\()(\d+(?:,\d+)?)( Zoll[^<]*</productDescription>)" value="$1$2$3<productScreenSize>$2 inch</productScreenSize>"/>
<parameter key="(<productDescription>[^<|\d]*)(\d+)("[^<]*</productDescription>)" value="$1$2$3<productScreenSize>$2 inch</productScreenSize>"/>
<parameter key="(<productDescription>[^<]*Full HD[^<]*</productDescription>)" value="<productSubCategory>Full HD</productSubCategory>$1"/>
<parameter key="(<productDescription>[^<]*HD-ready[^<]*</productDescription>)" value="<productSubCategory>HD-ready</productSubCategory>$1"/>
<parameter key="(<productDescription>[^<]*(?:4K\s|Ultra HD|SUHD)[^<]*</productDescription>)" value="<productSubCategory>4K</productSubCategory>$1"/>
<parameter key="(<productDescription>[^<]*»[A-Z-]+)(\d+)([^<]*</productDescription>)(?!<productScreenSize>)" value="$1$2$3<productScreenSize>$2 inch</productScreenSize>"/>
<parameter key="(<productDescription>[^<]*\()(\d+)("\)[^<]*</productDescription>)" value="$1$2$3<productScreenSize>$2 inch</productScreenSize>"/>
</list>
</operator>
<connect from_port="in 1" to_op="Process (6)" to_port="document"/>
<connect from_op="makeTags1" from_port="output" to_op="Process (6)" to_port="xslt document"/>
<connect from_op="Process (6)" from_port="document" to_op="Process (7)" to_port="document"/>
<connect from_op="getModelInfo (3)" from_port="output" to_op="Process (7)" to_port="xslt document"/>
<connect from_op="Process (7)" from_port="document" to_op="sub" to_port="input"/>
<connect from_op="sub" from_port="through" to_op="clean4model (2)" to_port="document"/>
<connect from_op="clean4model (2)" from_port="document" to_op="clean4features (2)" to_port="document"/>
<connect from_op="clean4features (2)" from_port="document" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="store" compatibility="5.3.015" expanded="true" height="60" name="Store XML" width="90" x="313" y="30">
<parameter key="repository_entry" value="xml"/>
</operator>
<connect from_op="Retrieve xhtml" from_port="output" to_op="make XML" to_port="in 1"/>
<connect from_op="make XML" from_port="out 1" to_op="Store XML" to_port="input"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
</process>
So maybe the correct question is more like : how can I get / store the token data as text instead of the (original) text data, which seems to be the default option ?
<Document>
<default>
<metaData class="linked-hash-map" id="4"></metaData>
<text>my-original-data</text>
<tokenSequence id="5">
<com.rapidminer.operator.text.Token id="6">
<token>my-cleaned-data</token>
<weight>1.0</weight>
</com.rapidminer.operator.text.Token>
</tokenSequence>
</default>
</Document>
Tagged:
0
Answers
-
Well, apparently combining documents did the trick. Not sure why but doing that 'removed' the original text and provided the tokenized data. Hopefully this is useful for other too.0