"simple text extraction"
xtraplus
New Altair Community Member
Hi,
I have one folder (I call it here prime) containing many folders of which some contain html-files. I want to read "prime" with "process documents from files" operator. Inside this operator I use "Extract information" Xpath: //h;*[contains(.,"@)]/. Basically I want to extract the emails from my files.
I just give "process documents from files" the path to "prime" as text directory. Is that correct? I want the process to find the subfolders there with the files.
This is the code:
How do you get it to work properly?
I have one folder (I call it here prime) containing many folders of which some contain html-files. I want to read "prime" with "process documents from files" operator. Inside this operator I use "Extract information" Xpath: //h;*[contains(.,"@)]/. Basically I want to extract the emails from my files.
I just give "process documents from files" the path to "prime" as text directory. Is that correct? I want the process to find the subfolders there with the files.
This is the code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>When I start the process, then its finished after 0 s, without anything extracted.
<process version="5.1.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<process expanded="true" height="161" width="279">
<operator activated="true" class="text:process_document_from_file" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="179" y="75">
<list key="text_directories">
<parameter key="all" value="C:\Users\Home\Desktop\Sites"/>
</list>
<parameter key="extract_text_only" value="false"/>
<parameter key="create_word_vector" value="false"/>
<process expanded="true" height="414" width="762">
<operator activated="true" class="text:extract_information" compatibility="5.1.001" expanded="true" height="60" name="Extract Information" width="90" x="279" y="96">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Mail" value="//h;*[contains(.,"@&quot;)]/."/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="36"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
How do you get it to work properly?
Tagged:
0
Answers
-
Hello,
As a start, replace the extract information operator with the tokenize operator.
regards
Andrew0 -
Hi Andrew,
I would like to do it as demonstrated in this video:
http://www.youtube.com/watch?v=vKW5yd1eUpA&;feature=player_embedded
when I hit start I get a process falied Message: "A DocType cannot be added after the root element"
What does this mean?
when I add a /* to my directory I dont get this message. But this is different from the video. However
when I start the process, then its finished after 0 s, without anything extracted.
why should I use "tokenize" instead ? I want to use a complex Xpath query to extract certain information0 -
Hello
The Xpath has to work on something but I can't work out what the input XML looks like (and Xpath is one of the "dirty dozen development" thiings that mere mortals should never have to worry about).
Knowing this "eternal verity", I tend to make everything look like a spreadsheet and then work from there.
Try tokenize and see what happens (it might not help but without the input data, it's difficult to say).
Cheers,
Andrew0 -
Hi,
I tried tokenize, but nothig gets extracted. My input is just html-files downloaded per "web crawl"
How do i make the htmls look like spreadsheets, please?0 -
Hello
It's difficult to say without the data but I would try some simpler XPath first and build from there. You could also set a breakpoint before and after the extract operator to see if this gives insight into what is happening.
regards
Andrew0 -
xtraplus wrote:
when I hit start I get a process falied Message: "A DocType cannot be added after the root element"
What does this mean?
Hi,
I received this error from time to time when crawling lots of pages, where some of them generated script errors. If possible visit the URL for page causing the process to stop in your browser. When the problem appeared for me, there were PHP error messages contained on the page. They were put at the very beginning of the generated HTML document, thus making the document invalid. The XPath interpreter seems to be restrictive with that. The error message says, that a doctype is declared at a point where this isn't allowed. This means something has to be found prior to this declaration (which should usually be the first line). I wish this error would be ignored and the page would be skipped, but unfortunately this aborts the whole process.
You can probably use "Handle Exception" to keep the process running, but since the page may contain interesting content although an error was generated, I used another approach. I just used a "Replace" operator for each page, replacing "(?is).*?(<!doctype)" by "$1", which should remove anything in front of the doctype declaration. This needs some additional computing time, but helped a lot for me.
Regards
Matthias0 -
Hi Andrew and Matthias,
thanks, the breakpoint method seemed to work. I found corrupting html-files.
When I filter with
//h:a[contains(@href,"@)]
I get with rapidminer:
<a xmlns="http://www.w3.org/1999/xhtml" shape="rect" href="mailto:abc@abc.com">abc@abc.com</a>
when I do the same XPath query with google.docs, then I just get:
abc@abc.com
How can I get the google.docs results in rapidminer?
Where do I have to place the "handle exception" operater in order to catch it? It didnt seem to work for the places I tried
Is there more involved in catching exceptions than placing the "handle exception" operator ?0 -
Hi,
do you receive the results from Google as plain text or as hyperlink? Maybe the HTML code is just converted into a link? The XPath expression you are using should usually give you the whole a-tag not just the text or the href-attribute. To get them, you can either append /text() for the link text, or preferably /@href for the content of the href-attribute.
If you put the operators that may cause an exception inside the "Handle Exception" operator, this should work. But I tested this only once some time ago. Later I always tried to adjust or "repair" the data, that might cause the problems for some operators. But the success of this depends on the error source and how creative it is
Regards
Matthias0 -
Hi Matthias,
thanks, though the exception handle around the causing operator does not work to prevent the process to fail.
It could be I have too many corrupting html-files to sort them out by hand.
Unfortunately the replacing takes too much time processing
sorting them out by hand is probably my last option.
I get the "A DocType cannot be added after the root element" exception in an irrational manner.
One time the process fails at application 70, then I sort 70 out and next I get the process failing at 69 and so on.0 -
Hi,xtraplus wrote:
One time the process fails at application 70, then I sort 70 out and next I get the process failing at 69 and so on.
exactly because of this fact, replacing the errors seemed to be a good solution for me. But I must agree, the runtime of the replace argument I first posted is far too high, since the whole document is scanned for the pattern. I also faced the runtime problem for my first attempts and the solution wasn't very tricky. I'm sorry, it seems I copied the regex from one of the early processes. You just have to add one symbol to scan only the beginning of the document. Try this, it should speed things up dramatically:(?is)^.*?(<!doctype)
Regards
Matthias0 -
-
Hi,
quickly written - and without testing it - this might work:<a[^>]*href\s*=\s*['"](.*?)['"][^>]*>
But the processing time will certainly be pretty high...
Replacing with my previously posted regex should at least eliminate all errors from this type: "A DocType cannot be added after the root element"
Regards
Matthias
Edit: Oops, the check for the @ sign is of course missing. This one would require the use of an assertion. I am currently not having the time to look this up, since I don't use them often. Otherwise you could collect all href-values like above and then check them in a second step.0 -
Hi
thanks
do you know a good site, where i can look up regular expressions, please?
0 -
Hi,
this one should contain some useful information: http://www.regular-expressions.info/
There are some other sites in German language, which should not be a problem for you
http://www.regenechsen.de/phpwcms/index.php?regex_allg
http://www.sql-und-xml.de/regex/
Regards
Matthias0