Writing an XPATH query to retrieve text within quotes
I'm having trouble retrieving text within double quotes from a webpage using information extraction. I already have a number of xpaths which are working as expected (all of my xpaths work apart from the last one in the xml process code). Does anyone know what the terminology is for retrieving text that is inside double quotes?
The following xpath works fine in google docs but doesn't in rapidminer: Google docs is still retireves the text even though it's within quotes. In Rapidminer it gives blank values.
<parameter key="TEST" value="//*[@class=&quot;single-review"]/text()"/>
Overall process:
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="false" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Information" width="90" x="514" y="34">
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<operator activated="true" class="text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files (2)" width="90" x="246" y="34">
<list key="text_directories">
<parameter key="all" value="C:\Users\heaveya\Desktop\Text-Mining\project_1"/>
</list>
<parameter key="use_file_extension_as_type" value="false"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="999"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Information (2)" width="90" x="246" y="34">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Game Title" value="//*[@class=&quot;id-app-title"]/text()"/>
<parameter key="Date of First Review" value="//*[@class=&quot;review-date"]/text()"/>
<parameter key="Description" value="//*[@jsname=&quot;C4s9Ed"]/text()"/>
<parameter key="No:OfReviews" value="//*[@class=&quot;reviews-num"]/text()"/>
<parameter key="Overall Average Rating" value="//*[@class=&quot;score"]/text()"/>
<parameter key="Game Makers" value="//*[@class=&quot;document-subtitle primary"]/h:span/text()"/>
<parameter key="No. of Downloads" value="//*[@itemprop=&quot;numDownloads"]/text()"/>
<parameter key="Last Updated" value="//*[@itemprop=&quot;datePublished"]/text()"/>
<parameter key="What's new" value="//*[@class=&quot;recent-change"]/text()"/>
<parameter key="What's new 1" value="//h:div[2][contains(@class,'recent-change')]/text()"/>
<parameter key="What's new 2" value="//h:div[3][contains(@class,'recent-change')]/text()"/>
<parameter key="What's new 3" value="//h:div[4][contains(@class,'recent-change')]/text()"/>
<parameter key="TEST" value="//*[@class=&quot;single-review"]/text()"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information (2)" to_port="document"/>
<connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Files (2)" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Best Answer
-
Hi Aidan,
Thanks the following details were helpful.
When I checked the following website in developer tool, the HTML for review looks nested. So the XPath you have used must be modified further to extract the review text.
Following websites may come handy in helping you build the XPath.
https://www.scrapehero.com/how-to-scrape-amazon-product-reviews/
Also, you could use "Process document from Web instead of Process Documents from Files" to extract data directly from any web page. Would suggest using Cut Document operator inside Process document from Web; Extract information, Extract content inside Cut document subprocess.
Hope this helps!
*** Useful webinars around Text Processing in RapidMiner***
https://www.youtube.com/user/RapidIVideos/search?query=text+mining
Also, this Text and Web Mining course really helped me understand how easily and efficiently I could do Web and Text analysis in RapidMiner!
https://rapidminer.com/training/
Cheers,
1
Answers
-
Hi again,
Maybe i'll try to explain my problem a little bit more. As you can see below the phrase is inside double quotes and as this is the case I can't seem to be able to get this phrase to appear in my results by simply attaching the /text() like i've been using previously. So if anyone knows the syntax to retrieve the text here within the quotes then I should be ok. Even if it only works normally in google docs I might be able to figure it out through trial and error.
Thanks for reading,
Aidan
0 -
Hi,
Would it be possible to share sample data here(in specific the part of the data which matches the XPath query "Test") so that I can try re-creating the error and resolve it?
Cheers,
0 -
Hi Pavithra,
My process starts with a process document from files operator (using a repo, the file attached is one such file)
Inside I have extract information operator with nominal and XPATH chosen. I also have extract txy only (content type: txt) assume html ticked on.
The data is from this page (which is the text file too) : https://play.google.com/store/apps/details?id=com.squareenixmontreal.hitmansniperandroid&hl=enl
And then I inspect element for the review to find my class that's in my TEST parameter:
Thanks for looking into it,
Regards,
Aidan
0 -
Hi Aidan,
Thanks the following details were helpful.
When I checked the following website in developer tool, the HTML for review looks nested. So the XPath you have used must be modified further to extract the review text.
Following websites may come handy in helping you build the XPath.
https://www.scrapehero.com/how-to-scrape-amazon-product-reviews/
Also, you could use "Process document from Web instead of Process Documents from Files" to extract data directly from any web page. Would suggest using Cut Document operator inside Process document from Web; Extract information, Extract content inside Cut document subprocess.
Hope this helps!
*** Useful webinars around Text Processing in RapidMiner***
https://www.youtube.com/user/RapidIVideos/search?query=text+mining
Also, this Text and Web Mining course really helped me understand how easily and efficiently I could do Web and Text analysis in RapidMiner!
https://rapidminer.com/training/
Cheers,
1 -
Hi Pavithra,
Thanks a lot, your post was very helpful. The amazon review site is very similar to mine. I still haven't been able to figure it out. If anyone else has any ideas then great? if not, I can have another go tomorrow with a clear head!
Rgds,
Aidan
1 -
Hi again,
I've played around with this again and not been able to get it to work.
The games that have review content are coming up blank, whereas games without content are coming up with a question mark. So, I believe it's working, it's just not spitting out any content for me. I've also tried //h:div[@class='review-text']/descendant::text() and also //h* at the start and also //h:div[1][contains(@class,'review-text')]/text(), which seem to be the correct syntax but don't display values.
Also //h:span[@class='review-title']/text() displays values that are not in quotes (very small amount).
Would anyone have any more suggestions for me?
0