"[SOLVED] Accessing Text Data in Blob in MySQL"

Datadude
Datadude New Altair Community Member
edited November 5 in Community Q&A
I'm trying to access a text stored in a Blob field in MySQL.  However as I'm putting my job together I"m getting an error message:

The example set must contain at least one text attribute.  

It doesn't seem like Rapid understands that there might be text data in that binary field.  I"m trying to figure out how to do the conversion so that Rapid Miner understands what is going on here.  I"m attempting to connect my Read Database component to my Process Documents from Data component so that the Process Documents from Data Component can execute upon the text residing in the binary field.  Is there a way to do this conversion in RapidMiner?

Answers

  • Hello,

    It's probably because you need to change the type of the attribute you're interested in to text. Use the "nominal to text" operator.

    Regards,

    Andrew
  • Datadude
    Datadude New Altair Community Member
    Ok...so I did that and I seem to be getting to the next step.  Thanks...but now I'm getting a another exception:

    Dec 16, 2012 11:39:00 PM SEVERE: Process failed: operator cannot be executed (java.lang.String cannot be cast to org.jdom.Text). Check the log messages...
    Dec 16, 2012 11:39:00 PM SEVERE: Here:          Process[1] (Process)
              subprocess 'Main Process'
                +- Read Database[1] (Read Database)
                +- Nominal to Text[1] (Nominal to Text)
                +- Process Documents from Data[1] (Process Documents from Data)
              subprocess 'Vector Creation'
          ==>        +- Extract Information[1] (Extract Information)

    Is there something I need to do before I can process an xml string with XPath?
  • The best thing to do is to post your process so we can see the details.

    Andrew
  • Datadude
    Datadude New Altair Community Member
    Here is it:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <parameter key="logfile" value="/Users/wardloving/Documents/Data Mining/log.out"/>
        <parameter key="resultfile" value="/Users/wardloving/Documents/Data Mining/results.out"/>
        <process expanded="true" height="116" width="614">
          <operator activated="true" class="read_database" compatibility="5.2.008" expanded="true" height="60" name="Read Database" width="90" x="45" y="30">
            <parameter key="connection" value="Local MySQL Nutch"/>
            <parameter key="query" value="SELECT content&#10;FROM webpage where Id = 'org.episcopalchurch.www:http/parish/all-saints-episcopal-church-vista-ca'"/>
            <enumeration key="parameters"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="5.2.008" expanded="true" height="76" name="Nominal to Text" width="90" x="246" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="content"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="514" y="30">
            <parameter key="create_word_vector" value="false"/>
            <list key="specify_weights"/>
            <process expanded="true" height="252" width="1095">
              <operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information" width="90" x="112" y="75">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="Name" value="substring-before(//title, ',')"/>
                  <parameter key="Staff Name" value="substring-before(//*[@class = 'field field-type-text field-field-clergy']/div/div/node()[not(self::div)],',')"/>
                  <parameter key="Staff Title" value="substring-after(//*[@class = 'field field-type-text field-field-clergy']/div/div/node()[not(self::div)], ',')"/>
                  <parameter key="Address Line 1" value="//*[@class = 'street-address']"/>
                  <parameter key="City" value="//*[@class = 'locality']"/>
                  <parameter key="State" value="//*[@class = 'region']"/>
                  <parameter key="Zip" value="//*[@class = 'postal-code']"/>
                  <parameter key="Email" value="//*[@class = 'field field-type-text field-field-email']/div/div/node()[not(self::div)]"/>
                  <parameter key="Phone" value="//*[@class = 'field field-type-text field-field-phone']/div/div/node()[not(self::div)]"/>
                  <parameter key="Fax" value="//*[@class = 'field field-type-text field-field-fax']/div/div/node()[not(self::div)]"/>
                  <parameter key="URL" value="//*[@class = 'field field-type-text field-field-fax']/div/div/node()[not(self::div)]"/>
                  <parameter key="Twitter" value="//*[@class = 'field field-type-text field-field-twitter']/div/div/node()[not(self::div)]"/>
                </list>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="document" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="36"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Database" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="word list" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • Hello

    I tried with some local data and it seems the output from the read database operator is already of the right type so the conversion is not necessary. So I've learned something  :)

    Try unchecking "assume html"  on the "extract information" operator.

    Andrew
  • Datadude
    Datadude New Altair Community Member
    Thanks awchisholm for the tip.  When I removed this attribute the error stopped showing up.  This is good.  Unfortunately, it revealed that my HTML content in my database has been truncated/corrupted making parsing with XPath difficult.  Sometimes you just can't win but at least I know what is going on.