Process Web Spanish

Xannix
Xannix New Altair Community Member
edited November 5 in Community Q&A
Hi everyone!
I'm trying "Process Web" in spanish language and i'm having problems with the accents.
The web page has "charset=iso-8859-1" then i try to put encoding parameter as "iso-8859-1" but it doesn't work. (I try all usual encoding)
The curious thing is that "Crawl web" works  but only if I mark "write pages into files", because if I don't, it doesn't work too.

Is this a bug?

Does anyone know how can i solve it?

Thanks : )
Tagged:

Answers

  • Xannix
    Xannix New Altair Community Member
    I've dicovered this problem hapens only sometimes, and I don't know why.
    In this code you can see atribute "Introduccion" has diferent values depending on the method:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" expanded="true" name="Process">
       <parameter key="encoding" value="ISO-8859-1"/>
       <process expanded="true" height="325" width="685">
         <operator activated="true" class="web:crawl_web" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
           <parameter key="url" value="http://www.madrimasd.org/informacionidi/noticias/default.asp?Page=1&amp;Tipo=2"/>
           <list key="crawling_rules">
             <parameter key="2" value="http://www.madrimasd.org/noticias/.*"/>
             <parameter key="0" value="http://www.madrimasd.org/noticias/.*"/>
           </list>
           <parameter key="add_pages_as_attribute" value="true"/>
           <parameter key="output_dir" value="C:\"/>
           <parameter key="extension" value="htm"/>
           <parameter key="max_pages" value="3"/>
           <parameter key="delay" value="100"/>
           <parameter key="max_threads" value="3"/>
           <parameter key="max_page_size" value="1000"/>
         </operator>
         <operator activated="true" class="web:process_web" expanded="true" height="60" name="Process Web" width="90" x="45" y="120">
           <parameter key="url" value="http://www.madrimasd.org/informacionidi/noticias/default.asp?Page=1&amp;Tipo=2"/>
           <list key="crawling_rules">
             <parameter key="2" value="http://www.madrimasd.org/noticias/.*"/>
             <parameter key="0" value="http://www.madrimasd.org/noticias/.*"/>
           </list>
           <parameter key="add_pages_as_attribute" value="true"/>
           <parameter key="max_pages" value="3"/>
           <parameter key="delay" value="100"/>
           <parameter key="max_threads" value="3"/>
           <process expanded="true" height="422" width="752">
             <operator activated="true" class="text:transform_cases" expanded="true" height="60" name="Transform Cases" width="90" x="112" y="30"/>
             <operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information" width="90" x="313" y="30">
               <parameter key="query_type" value="XPath"/>
               <list key="string_machting_queries"/>
               <list key="regular_expression_queries"/>
               <list key="regular_region_queries"/>
               <list key="xpath_queries">
                 <parameter key="Introduccion" value="//h:p/text()"/>
               </list>
               <list key="namespaces"/>
               <list key="index_queries"/>
             </operator>
             <connect from_port="document" to_op="Transform Cases" to_port="document"/>
             <connect from_op="Transform Cases" from_port="document" to_op="Extract Information" to_port="document"/>
             <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" breakpoints="after" class="text:generate_extract" expanded="true" height="60" name="Generate Extract" width="90" x="246" y="30">
           <parameter key="source_attribute" value="Page"/>
           <parameter key="query_type" value="XPath"/>
           <list key="string_machting_queries">
             <parameter key="parrafismo" value="&lt;p&gt;.&lt;/p&gt;"/>
           </list>
           <list key="regular_expression_queries">
             <parameter key="Jurjur" value="Sin(.*)Blasco"/>
           </list>
           <list key="regular_region_queries"/>
           <list key="xpath_queries">
             <parameter key="Introduccion" value="//h:p/text()"/>
           </list>
           <list key="namespaces"/>
           <list key="index_queries"/>
           <parameter key="value_seperator" value="***"/>
         </operator>
         <connect from_op="Crawl Web" from_port="Example Set" to_op="Generate Extract" to_port="Example Set"/>
         <connect from_op="Process Web" from_port="example set" to_port="result 2"/>
         <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>

  • land
    land New Altair Community Member
    Hi,
    I think this is an issue with the encoding of the webpage. It's rather difficult to always read the correct encoding, if the web page doesn't specify it. We are usually assuming UTF-8 if nothing is specified in the html document.
    You could manually try to request the webpages in an appropriate terminal program and check if the encoding is correct. If not, you might add a bug to the tracker with a detailed example process. This would make my life much easier and will speed up the fixing :)

    Greetings,
      Sebastian
  • Xannix
    Xannix New Altair Community Member
    I'm not sure if I understand you...

    I can see this pages in my navigator, and I've seen in the source code of the page:
    <META http-equiv=Content-Type content="text/html; charset=iso-8859-1"> (I'm not sure if you refers to this)

    You told me to request the webpages in an appropiate terminal program... (navigator?, sorry I don't know what you are trying to tell me)

    In the example, you can see "Process web" operator, replaces the accents with a simbol, but with "Crawl web" operator, accent are well written (but only if is marked "write pages into files")

    I would like to help to fix it, but I don't know how

    Thanks for all
  • land
    land New Altair Community Member
    Hi,
    I have added a bug to the bug tracker. We will solve it as soon as possible.

    Greetings,
      Sebastian
  • Xannix
    Xannix New Altair Community Member
    Thanks for all : )