Dealing with JSON (downloading files with Crawl Web)
Loky
New Altair Community Member
Hi everyone.
Since very new to RapidMiner I have a few questions for you.
Here is what I try to do:
I have an excel file filled with URLs and each URL of those is going to be crawled. Everything went fine till now. All the tests with html pages went perfect. Now, the problem is that my URLs are giving me json files. I'm trying to store those files but I get no results.
Her is my process:
Thanks a lot in advance,
Loky.
Since very new to RapidMiner I have a few questions for you.
Here is what I try to do:
I have an excel file filled with URLs and each URL of those is going to be crawled. Everything went fine till now. All the tests with html pages went perfect. Now, the problem is that my URLs are giving me json files. I'm trying to store those files but I get no results.
Her is my process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>Any of you have any ideas for me? maybe some User Agent tricks so I can actually "see" those json files as text?
<process version="5.1.011">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
<process expanded="true" height="550" width="1150">
<operator activated="true" class="retrieve" compatibility="5.1.011" expanded="true" height="60" name="Retrieve (2)" width="90" x="112" y="300">
<parameter key="repository_entry" value="URLs"/>
</operator>
<operator activated="true" class="loop_examples" compatibility="5.1.011" expanded="true" height="94" name="Loop Examples" width="90" x="514" y="300">
<process expanded="true" height="969" width="547">
<operator activated="true" class="extract_macro" compatibility="5.1.011" expanded="true" height="60" name="Extract Macro" width="90" x="380" y="300">
<parameter key="macro" value="website_url"/>
<parameter key="macro_type" value="data_value"/>
<parameter key="statistics" value="max"/>
<parameter key="attribute_name" value="A"/>
<parameter key="example_index" value="%{example}"/>
</operator>
<operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="380" y="390">
<parameter key="url" value="%{website_url}"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value=".*"/>
<parameter key="store_with_matching_url" value=".*"/>
</list>
<parameter key="output_dir" value="C:\Users\ls\Desktop\test"/>
<parameter key="extension" value="json"/>
<parameter key="max_pages" value="5"/>
<parameter key="max_depth" value="1"/>
<parameter key="domain" value="server"/>
<parameter key="max_page_size" value="5000"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0.2) Gecko/20100101 Firefox/6.0.2 "/>
</operator>
<connect from_port="example set" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Extract Macro" from_port="example set" to_port="example set"/>
<connect from_op="Crawl Web" from_port="Example Set" to_port="output 1"/>
<portSpacing port="source_example set" spacing="234"/>
<portSpacing port="sink_example set" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve (2)" from_port="output" to_op="Loop Examples" to_port="example set"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="54"/>
</process>
</operator>
</process>
Thanks a lot in advance,
Loky.
Tagged:
0
Answers
-
Managed to make this work. All you need to do is to generate an attribute and use it for the generated file names. If you don't do this your stored files are going to overwrite one over the other.0