Problem with text processing plugin
Hi all,
I encountered a problem in RM text processing plugin.
The program was working fine before but failed for some text files with non ascii characters.
The setup is using "Process Documents from Files" operator, what's in that operator are:
Transform Cases -> Tokenize -> Filter Stopwords -> Stem -> Filter Tokens (by Length)
Is it a bug in the text processing plugin or sth wrong with my setup/program? Thanks.
--------------------------------------------------------------------------------------------------
SEVERE: Process failed: operator cannot be executed (The name "lnêäûð6ûonxßvâisÿˆïqwòb-ûfåàãwcû-kžîžìeî" is not legal for JDOM/XML Namespace prefixs: Namespace prefixes cannot contain the character "ˆ".). Check the log messages...
org.jdom.IllegalNameException: The name "lnêäûð6ûonxßvâisÿˆïqwòb-ûfåàãwcû-kžîžìeî" is not legal for JDOM/XML Namespace prefixs: Namespace prefixes cannot contain the character "ˆ".
...
...
---------------------------------------------------------------------------------------------------
Exception in thread "main" org.jdom.IllegalNameException: The name "home" is not legal for JDOM/XML attributes: XML names cannot begin with the character "h".
at org.jdom.Attribute.setName(Attribute.java:361)
at org.jdom.Attribute.<init>(Attribute.java:228)
at org.jdom.Attribute.<init>(Attribute.java:276)
at org.jdom.DefaultJDOMFactory.attribute(DefaultJDOMFactory.java:93)
at org.jdom.input.SAXHandler.startElement(SAXHandler.java:544)
at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:388)
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:453)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:851)
at com.rapidminer.operator.text.io.filereader.HTMLFileReader.readStream(HTMLFileReader.java:72)
at com.rapidminer.operator.text.io.filereader.AbstractFileReader.readFile(AbstractFileReader.java:37)
at com.rapidminer.operator.text.io.FileDocumentInputIterator.next(FileDocumentInputIterator.java:94)
at com.rapidminer.operator.text.io.FileDocumentInputIterator.next(FileDocumentInputIterator.java:43)
at com.rapidminer.operator.text.io.AbstractDocumentInputOperator.doWork(AbstractDocumentInputOperator.java:228)
at com.rapidminer.operator.Operator.execute(Operator.java:833)
at com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
at com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
at com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:379)
at com.rapidminer.operator.Operator.execute(Operator.java:833)
at com.rapidminer.Process.run(Process.java:925)
at com.rapidminer.Process.run(Process.java:848)
at com.rapidminer.Process.run(Process.java:807)
at com.rapidminer.Process.run(Process.java:802)
at com.rapidminer.Process.run(Process.java:792)
at Filter.filter(PornFilter.java:84)
at Filter.main(PornFilter.java:128)
I encountered a problem in RM text processing plugin.
The program was working fine before but failed for some text files with non ascii characters.
The setup is using "Process Documents from Files" operator, what's in that operator are:
Transform Cases -> Tokenize -> Filter Stopwords -> Stem -> Filter Tokens (by Length)
Is it a bug in the text processing plugin or sth wrong with my setup/program? Thanks.
--------------------------------------------------------------------------------------------------
SEVERE: Process failed: operator cannot be executed (The name "lnêäûð6ûonxßvâisÿˆïqwòb-ûfåàãwcû-kžîžìeî" is not legal for JDOM/XML Namespace prefixs: Namespace prefixes cannot contain the character "ˆ".). Check the log messages...
org.jdom.IllegalNameException: The name "lnêäûð6ûonxßvâisÿˆïqwòb-ûfåàãwcû-kžîžìeî" is not legal for JDOM/XML Namespace prefixs: Namespace prefixes cannot contain the character "ˆ".
...
...
---------------------------------------------------------------------------------------------------
Exception in thread "main" org.jdom.IllegalNameException: The name "home" is not legal for JDOM/XML attributes: XML names cannot begin with the character "h".
at org.jdom.Attribute.setName(Attribute.java:361)
at org.jdom.Attribute.<init>(Attribute.java:228)
at org.jdom.Attribute.<init>(Attribute.java:276)
at org.jdom.DefaultJDOMFactory.attribute(DefaultJDOMFactory.java:93)
at org.jdom.input.SAXHandler.startElement(SAXHandler.java:544)
at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:388)
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:453)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:851)
at com.rapidminer.operator.text.io.filereader.HTMLFileReader.readStream(HTMLFileReader.java:72)
at com.rapidminer.operator.text.io.filereader.AbstractFileReader.readFile(AbstractFileReader.java:37)
at com.rapidminer.operator.text.io.FileDocumentInputIterator.next(FileDocumentInputIterator.java:94)
at com.rapidminer.operator.text.io.FileDocumentInputIterator.next(FileDocumentInputIterator.java:43)
at com.rapidminer.operator.text.io.AbstractDocumentInputOperator.doWork(AbstractDocumentInputOperator.java:228)
at com.rapidminer.operator.Operator.execute(Operator.java:833)
at com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
at com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
at com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:379)
at com.rapidminer.operator.Operator.execute(Operator.java:833)
at com.rapidminer.Process.run(Process.java:925)
at com.rapidminer.Process.run(Process.java:848)
at com.rapidminer.Process.run(Process.java:807)
at com.rapidminer.Process.run(Process.java:802)
at com.rapidminer.Process.run(Process.java:792)
at Filter.filter(PornFilter.java:84)
at Filter.main(PornFilter.java:128)