"Web Mining crawling prices of an internet page"

User: "luiz_vidal"
New Altair Community Member
Updated by Jocelyn

Guys, 

 

I am trying to create a process to crawl web pages from a site in order to get the prices of a variety of products. I am trying to do the following, I created a loop, because I want to crawl to get page by page and save into my disk, after that I want to get this html saved into my disk and extract only the name of the product and price for example, but I'm not being able to do that. Would you guys please help me?
I was able to get the pages in sequence, but somehow I can't save into the disk as they are overwritten

 

First I want to collect the pages:

https://www.buscape.com.br/cerveja?pagina=1

https://www.buscape.com.br/cerveja?pagina=2

...

https://www.buscape.com.br/cerveja?pagina=200

Follow my process below

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="concurrency:loop" compatibility="8.0.001" expanded="true" height="103" name="Loop" width="90" x="179" y="34">
<process expanded="true">
<operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="246" y="238">
<parameter key="url" value="https://www.buscape.com.br/cerveja?pagina=%{page}"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value="cerveja"/>
</list>
<parameter key="retrieve_as_html" value="true"/>
<parameter key="add_content_as_attribute" value="true"/>
<parameter key="write_pages_to_disk" value="true"/>
<parameter key="output_dir" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\Cerveja"/>
</operator>
<operator activated="true" class="generate_macro" compatibility="8.0.001" expanded="true" height="82" name="Generate Macro" width="90" x="112" y="34">
<list key="function_descriptions">
<parameter key="page" value="%{page}"/>
</list>
</operator>
<connect from_port="input 1" to_op="Generate Macro" to_port="through 1"/>
<connect from_op="Crawl Web" from_port="example set" to_port="output 2"/>
<connect from_op="Generate Macro" from_port="through 1" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
</operator>
<connect from_op="Loop" from_port="output 1" to_port="result 2"/>
<connect from_op="Loop" from_port="output 2" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

After that when I have all pages "collected", I was trying to use xpath to get only the field I need inside the html.

But, somehow when I copy paste it from google, it doesn't work.

 

Can you guys please help me create a simple example of process ?

 

Thanks in advance.

Find more posts tagged with

Sort by:
1 - 1 of 11