"HTML Content Extraction Error"

mgstick
mgstick New Altair Community Member
edited November 5 in Community Q&A
I'm crawling the web and then attempting to extract the content of the downloaded HTML pages using the Web Mining -> HTML Processing -> Extract Content operator.

My process successfully crawls the web and writes the HTML pages to disk; it also processes the Documents data set returned by the crawler using the Loop Collection operator, or at least seems to execute the Unescape HTML Document operator on each Document. The sub-process in the Loop Collection operator begins with the Unescape HTML Document operator and is then supposed to process each Document using the Extract Content operator. When it gets to this point I get the following error (see below for full error message output):

          Process failed: org/apache/commons/lang/StringEscapeUtils (ProcessThread.run())
            java.lang.NoClassDefFoundError: org/apache/commons/lang/StringEscapeUtils

I've downloaded the Apache Commons Lang jar file (commons-lang-2.5.jar) and attempted to make it available for RapidMiner to use; but with no luck. I tried adding it to my default CLASSPATH, adding it to the CLASSPATH from within my Terminal via the Set command, and I've tried adding it to the CLASSPATH explicitly on the RapidMiner execution command line i.e. java -classpath lib/commons-lang-2.5.jar -jar lib/rapidminer.jar and yet I still get the java.lang.NoClassDefFoundError: org/apache/commons/lang/StringEscapeUtils error.

I don't know what to try next. Any help would be greatly appreciated.

Thanks in advance for your help.


2010-10-28 11:58:03 SEVERE: Process failed: org/apache/commons/lang/StringEscapeUtils (ProcessThread.run())
  java.lang.NoClassDefFoundError: org/apache/commons/lang/StringEscapeUtils
      com.rapidminer.operator.web.html.HTMLTextExtractionOperator.doWork(HTMLTextExtractionOperator.java:324)
      com.rapidminer.operator.text.io.AbstractTokenProcessor.doWork(AbstractTokenProcessor.java:60)
      com.rapidminer.operator.Operator.execute(Operator.java:768)
      com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
      com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
      com.rapidminer.operator.collections.CollectionIterationOperator.doWork(CollectionIterationOperator.java:90)
      com.rapidminer.operator.Operator.execute(Operator.java:768)
      com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
      com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
      com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:368)
      com.rapidminer.operator.Operator.execute(Operator.java:768)
      com.rapidminer.Process.run(Process.java:863)
      com.rapidminer.Process.run(Process.java:770)
      com.rapidminer.Process.run(Process.java:765)
      com.rapidminer.Process.run(Process.java:755)
      com.rapidminer.gui.ProcessThread.run(ProcessThread.java:65)
Caused by:
  java.lang.ClassNotFoundException: org.apache.commons.lang.StringEscapeUtils
      java.net.URLClassLoader$1.run(URLClassLoader.java:202)
      java.security.AccessController.doPrivileged(Native Method)
      java.net.URLClassLoader.findClass(URLClassLoader.java:190)
      java.lang.ClassLoader.loadClass(ClassLoader.java:307)
      java.lang.ClassLoader.loadClass(ClassLoader.java:248)
      com.rapidminer.operator.web.html.HTMLTextExtractionOperator.doWork(HTMLTextExtractionOperator.java:324)
      com.rapidminer.operator.text.io.AbstractTokenProcessor.doWork(AbstractTokenProcessor.java:60)
      com.rapidminer.operator.Operator.execute(Operator.java:768)
      com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
      com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
      com.rapidminer.operator.collections.CollectionIterationOperator.doWork(CollectionIterationOperator.java:90)
      com.rapidminer.operator.Operator.execute(Operator.java:768)
      com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
      com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
      com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:368)
      com.rapidminer.operator.Operator.execute(Operator.java:768)
      com.rapidminer.Process.run(Process.java:863)
      com.rapidminer.Process.run(Process.java:770)
      com.rapidminer.Process.run(Process.java:765)
      com.rapidminer.Process.run(Process.java:755)
      com.rapidminer.gui.ProcessThread.run(ProcessThread.java:65)
2010-10-28 11:58:03 SEVERE: Here:          Process[1] (Process)
          subprocess 'Main Process'
            +- Crawl Web[1] (Crawl Web)
            +- Multiply[1] (Multiply)
            +- Data to Documents[1] (Data to Documents)
            +- Loop Collection[1] (Loop Collection)
          subprocess 'Iteration'
                  +- Unescape HTML Document[1] (Unescape HTML Document)
      ==>        +- Extract Content[1] (Extract Content)
                  +- Write Document[0] (Write Document) (ProcessThread.run())
Tagged:

Answers

  • mgstick
    mgstick New Altair Community Member
    Hi,

    I found a way to add jar files to RapidMiner's CLASSPATH by modifying the RapidMinerGUI script and then launching RapidMiner using that script.

    Thanks (though none replied :(
  • land
    land New Altair Community Member
    Hi,
    sorry for that, but we are quite busy these days...

    One question: Do you have the latest RapidMiner AND TextProcessingExtension version installed?

    Greetings,
      Sebastian