"simple text extraction"
Hi,
I have one folder (I call it here prime) containing many folders of which some contain html-files. I want to read "prime" with "process documents from files" operator. Inside this operator I use "Extract information" Xpath: //h;*[contains(.,"@)]/. Basically I want to extract the emails from my files.
I just give "process documents from files" the path to "prime" as text directory. Is that correct? I want the process to find the subfolders there with the files.
This is the code:
How do you get it to work properly?
I have one folder (I call it here prime) containing many folders of which some contain html-files. I want to read "prime" with "process documents from files" operator. Inside this operator I use "Extract information" Xpath: //h;*[contains(.,"@)]/. Basically I want to extract the emails from my files.
I just give "process documents from files" the path to "prime" as text directory. Is that correct? I want the process to find the subfolders there with the files.
This is the code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>When I start the process, then its finished after 0 s, without anything extracted.
<process version="5.1.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<process expanded="true" height="161" width="279">
<operator activated="true" class="text:process_document_from_file" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="179" y="75">
<list key="text_directories">
<parameter key="all" value="C:\Users\Home\Desktop\Sites"/>
</list>
<parameter key="extract_text_only" value="false"/>
<parameter key="create_word_vector" value="false"/>
<process expanded="true" height="414" width="762">
<operator activated="true" class="text:extract_information" compatibility="5.1.001" expanded="true" height="60" name="Extract Information" width="90" x="279" y="96">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Mail" value="//h;*[contains(.,"@&quot;)]/."/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="36"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
How do you get it to work properly?