How to set DTD parameter in FeatureExtraction (rapidminer UI)

skarab
skarab New Altair Community Member
edited November 5 in Community Q&A
because I keep  getting IOException thrown from FeatureExtraction:

Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

Regards,
skarab
Tagged:

Answers

  • land
    land New Altair Community Member
    Hi,
    I'm sorry, but what exactly are you doing? It would be the easiest to post the process and do a little explanation. And for motivating all other users to answer your questions, it could be a smart move to add something like "hello" in front of your message...

    Greetings,
      Sebastian
  • skarab
    skarab New Altair Community Member
    I parse html page and here is code:
    <operator name="FeatureExtraction" class="FeatureExtraction" breakpoints="before,within,after">
                              <list key="texts">
                                <parameter key="tmp_file" value="%{parent_path}\tmp%{file_name}\%{file_name}"/>
                              </list>
                              <parameter key="default_content_type" value="html"/>
                              <parameter key="default_content_encoding" value="UTF-8"/>
                              <parameter key="default_content_language" value="pl"/>
                              <parameter key="use_content_attributes" value="true"/>
                              <parameter key="id_attribute_type" value="long"/>
                              <list key="attributes">
                                <parameter key="html" value="/h:html"/>
                              </list>
                              <list key="namespaces">
    <!-- I tried to set it in namespaces -->
                                <parameter key="html" value="C:\\workspace-rapidminer\xhtml1-transitional.dtd"/>
                              </list>
                          </operator>
  • land
    land New Altair Community Member
    Hi,
    I don't think, the namespace is either needed, nor is it correctly defined. So the easiest solution would be to erase this parameter...
    Anyway it is only used for XPath requests for more complicated XML objects...I have never had to use them for HTML...

    Greetings,
      Sebastian
  • skarab
    skarab New Altair Community Member
    Hi,

    Defining namespace does not matter in my case, I still get this exception... I am using Java 1.6.0.16 on VISTA.

    Regards
    Skarab
  • skarab
    skarab New Altair Community Member
    Hi,

    I solved the problem...

    First I removed
    <!DOCTYPE html PUBLIC [^>]*> using TextCleaner.

    After that I attached a path to local dtd:
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "C:\workspace-rapidminer\xhtml1-transitional.dtd" >
    using  SingleTextObjectInput:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "C:\workspace-rapidminer\xhtml1-transitional.dtd" >%{loop_value}

    Here is my brute force solution (I get a html page as a TextObject):

     <operator name="TextCleaner" class="TextCleaner">
                           <parameter key="deletion_regex" value="&lt;!DOCTYPE html PUBLIC [^&gt;]*&gt;"/>
                       </operator>
                       <operator name="TextObject2ExampleSet" class="TextObject2ExampleSet">
                           <parameter key="keep_text_object" value="true"/>
                           <parameter key="text_attribute" value="my_doc_text"/>
                           <parameter key="label_attribute" value="my_doc_label"/>
                       </operator>
                       <operator name="ValueIterator" class="ValueIterator" expanded="yes">
                           <parameter key="attribute" value="my_doc_text"/>
                           <operator name="SingleTextObjectInput" class="SingleTextObjectInput">
                               <parameter key="text" value="&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot; &quot;C:\workspace-rapidminer\xhtml1-transitional.dtd&quot; &gt;%{loop_value}"/>
                           </operator>
                       </operator>



    Regards,
    Wojtek