Hi,
I have an Excel file with 330+ links to web pages. I need to extract the text from all the web pages to do a clustering task.
I'm not able to achieve this with the usual process:
get pages >> data to documents >> loop collection > extract content >> documents to data.
The problem is that the operators are only able to extract (for all the pages) the same that we get with the "view page source" directly in the browser. So, what I get is an empty Text attribute.
I tested with only one link (
https://dre.pt/dre/detalhe/despacho/3219-2020-130112149) with the operator Get Page. This is what I get in the extracted document:
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="format-detection" content="telephone=no" />
<script type='text/javascript'>window.OutSystemsApp = { basePath: '/dre/' };</script>
<meta http-equiv="Content-Security-Policy" content="base-uri 'self'; child-src * gap:; frame-src * gap:; connect-src *; default-src 'self' 'unsafe-inline' *.google-analytics.com *.hotjar.com *.googletagmanager.com *.dre.pt *.hotjar.io *.doubleclick.net *.knightlab.com *.google.com *.google.pt gap: 'unsafe-inline' 'unsafe-eval'; font-src 'self' data:; img-src * blob:; script-src 'unsafe-inline' * 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; frame-ancestors *.incm.pt *.dre.pt 'self' gap:; report-uri /SecurityUtils/rest/Report/ReportViolations?Params=RyAoRWX6RljInm%2B3hwnwrmeQNc96mBOvSkaT58%2FC4zhhZv0xIQAa3h3ft2scL67pequ212Wx6csuqpGp8%2B%2F%2B%2Bw%3D%3D; " />
<meta http-equiv="X-Content-Security-Policy" content="base-uri 'self'; child-src * gap:; frame-src * gap:; connect-src *; default-src 'self' 'unsafe-inline' *.google-analytics.com *.hotjar.com *.googletagmanager.com *.dre.pt *.hotjar.io *.doubleclick.net *.knightlab.com *.google.com *.google.pt gap: 'unsafe-inline' 'unsafe-eval'; font-src 'self' data:; img-src * blob:; script-src 'unsafe-inline' * 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; frame-ancestors *.incm.pt *.dre.pt 'self' gap:; report-uri /SecurityUtils/rest/Report/ReportViolations?Params=RyAoRWX6RljInm%2B3hwnwrmeQNc96mBOvSkaT58%2FC4zhhZv0xIQAa3h3ft2scL67pequ212Wx6csuqpGp8%2B%2F%2B%2Bw%3D%3D; " />
<meta http-equiv="X-WebKit-CSP" content="base-uri 'self'; child-src * gap:; frame-src * gap:; connect-src *; default-src 'self' 'unsafe-inline' *.google-analytics.com *.hotjar.com *.googletagmanager.com *.dre.pt *.hotjar.io *.doubleclick.net *.knightlab.com *.google.com *.google.pt gap: 'unsafe-inline' 'unsafe-eval'; font-src 'self' data:; img-src * blob:; script-src 'unsafe-inline' * 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; frame-ancestors *.incm.pt *.dre.pt 'self' gap:; report-uri /SecurityUtils/rest/Report/ReportViolations?Params=RyAoRWX6RljInm%2B3hwnwrmeQNc96mBOvSkaT58%2FC4zhhZv0xIQAa3h3ft2scL67pequ212Wx6csuqpGp8%2B%2F%2B%2Bw%3D%3D; " />
<meta name="viewport" content="viewport-fit=cover, width=device-width, initial-scale=1" />
<script type="text/javascript">
(function () {
function appendMetaTagAttributes(metaTag, attribute, values) {
var elem = document.querySelector("meta[name=" + metaTag + "]");
if (elem) {
var attrContent = elem.getAttribute(attribute);
elem.setAttribute(attribute, (attrContent ? attrContent + "," : "") + values.join(","));
}
}
if (navigator && /OutSystemsApp/i.test(navigator.userAgent)) {
// If this app is running on the native shell, we want to disable the zoom
appendMetaTagAttributes("viewport", "content", ["user-scalable=no", "minimum-scale=1.0"]);
}
})();</script>
<script type="text/javascript" src="/dre/scripts/OutSystemsManifestLoader.js?3F3fZzzNKkqKoP2DsjtxFw"></script>
<script type="text/javascript" src="/dre/scripts/OutSystems.js?RnlDcii3Xz75iIHHERIZtA"></script>
<script type="text/javascript" src="/dre/scripts/OutSystemsReactView.js?0bmp5RZ49TZneVNXnO6ymw"></script>
<script type="text/javascript" src="/dre/scripts/cordova.js?7KqI9_oL9hClomz1RdzTqg"></script>
<script type="text/javascript" src="/dre/scripts/NullDebugger.js?pG_2wlzY3NYiuKZRtoLyQQ"></script>
<script type="text/javascript" src="/dre/scripts/DRE.appDefinition.js?otw_Nv9Nr+Q7EbWK92qVcw"></script>
<script type="text/javascript" src="/dre/scripts/OutSystemsReactWidgets.js?IdWooa_erXOfwU01FQUTuA"></script>
<link type="text/css" rel="stylesheet" href="/dre/css/_Basic.css?EqGzAe81QbZLXJyfY3oLwA"></link>
<script type="text/javascript">OSManifestLoader.indexVersionToken = "8Ah8C5iCZm4zS2Ya5zFMJg";
</script>
</head>
<body>
<div id="reactContainer"></div>
<noscript><span>JavaScript is required</span></noscript>
<script type="text/javascript" src="/dre/scripts/DRE.index.js?IcuQoXtODBlF5z87QVycVQ"></script>
</body>
</html>
How can I get the text from this? Can you help me, please?