Text Mining - Industry 4
Hey,
I want to extract all the texts from this page: http://www.plattform-i40.de/I40/Navigation/Karte/SiteGlobals/Forms/Formulare/EN/map-use-cases-formular.html and create a table with different factors extracted from these texts, each line is a case, each column is a data extracted from the text. I think i'll use 6 column: Value Creation, Product Examples, Region....
Then I want to link those datas to know which one fits most for an external given case. For instance: Given Case X fits at 80% with company of line 35, 60% with company of line 118, etc...
Do you know how I can do all of that?
It's for my Master Thesis.
Thanks a lot,
Charles
Answers
-
To summerize the first part of your question: You want to scrape this webpage and obtain the information included on this webpage. So how to do this?
This web page is clearly a result of a combination of HTML, CSS and Javascript. See the picture. All information is included but not all in clear HTML so "traditional" web scraping doesn't bring the required results. But still all information/data is availlable but you have to do something smarter like using Xpath in the webpage document to find and retrieve every individual piece of (AJAX/javascript) data in the document. You can do that in RapidMiner: Have a look at the toturial of a guy called El Chief on YouTube. https://www.youtube.com/watch?v=vKW5yd1eUpA
0 -
Hey,
Thanks a lot for answering, I did'nt manage to extract data from the html page. The link you sent me seems to be very useful but the classes used are not exactly the same and i don't manage to find the correct x-path to extract data.
Could you help me if you know how to correctly extract data from HTML.
From this page: http://www.plattform-i40.de/I40/Redaktion/EN/Use-Cases/082-research-and-development-center-in-the-field-of-industrial-automation/article-research-and-development-center-in-the-field-of-industrial-automation.html , I want to extract Manufacturing industry and to automatically link it with Application example.
Thanks,
Charles
0 -
hello @charlesmrt - welcome to the community. It was my hope that @ey's nice "Read HTML Table" operator would do the trick here but alas it did not. However using "Get Page" and "Extract Content" gets you pretty far:
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.000-BETA">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.000-BETA" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="85">
<parameter key="url" value="http://www.plattform-i40.de/I40/Redaktion/EN/Use-Cases/082-research-and-development-center-in-the-field-of-industrial-automation/article-research-and-development-center-in-the-field-of-industrial-automation.html"/>
<list key="query_parameters"/>
<list key="request_properties"/>
</operator>
<operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="313" y="85">
<parameter key="minimum_text_block_length" value="2"/>
</operator>
<connect from_op="Get Page" from_port="output" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Scott
0 -
Hey,
Thanks for answering, in the file attached, you can see the HTML, I just want to extract "software solution", I tried to use "//*[contains(.,'Product example')]/../span[last()]" or "//*[contains(.,'Product example')]/../span[1]" but it doesn't work.. How could I do?
Thanks,
Charles
0 -
oh that seems very complicated. I would use RegEx.
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="85">
<parameter key="url" value="http://www.plattform-i40.de/I40/Redaktion/EN/Use-Cases/082-research-and-development-center-in-the-field-of-industrial-automation/article-research-and-development-center-in-the-field-of-industrial-automation.html"/>
<list key="query_parameters"/>
<list key="request_properties"/>
</operator>
<operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="246" y="85">
<parameter key="minimum_text_block_length" value="2"/>
</operator>
<operator activated="true" class="text:cut_document" compatibility="7.5.000" expanded="true" height="68" name="Cut Document" width="90" x="380" y="85">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="Product Example" value="(?<=Product\sexample\s).*(?=Value\screation)"/>
</list>
<list key="regular_region_queries">
<parameter key="Product Example" value="Product\\sexample\\s.\\sValue"/>
</list>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
<process expanded="true">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:combine_documents" compatibility="7.5.000" expanded="true" height="82" name="Combine Documents" width="90" x="514" y="85"/>
<connect from_op="Get Page" from_port="output" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_op="Combine Documents" to_port="documents 1"/>
<connect from_op="Combine Documents" from_port="document" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Scott
0 -
Thanks,
I found an other way to do it, by downloading html page on my computer thanks to "Download them all", then I used a text processing and Extract Information with Regular Expression. I obtained a Table in which I got all the informations.
But i still have a question, in Regular expression, i can extract only one expression per column of my table, the query expression is unique, but sometimes i got many solutions for one attribute name. How can I do to have multiple solutions in one column, I used "|" but it makes a disjonction of element not an accumulation.
Thanks,
Charles
0