[SOLVED] Importing data from a text file
Hi all,
I wonder if someone could give me some advice? I am looking to import data from a text file based on pattern/text matching. For example process a text file similar to the below, looking to extract the field after "Directory of" and the corresponding text before "File(s)" and bytes.
So based on the file text file below, I would have three records:
Any help or hints would be greatly accepted
Carl
Directory of C:\Windows\addins
14/07/2009 06:32 <DIR> .
14/07/2009 06:32 <DIR> ..
10/06/2009 22:20 802 FXSEXT.ecf
1 File(s) 802 bytes
Directory of C:\Windows\assembly
12/05/2012 15:24 <DIR> .
12/05/2012 15:24 <DIR> ..
10/06/2009 21:39 66,728 big5.nlp
10/06/2009 21:39 82,172 bopomofo.nlp
10/06/2009 21:39 116,756 ksc.nlp
04/01/2012 04:34 4,567,040 mscorlib.dll
10/06/2009 21:40 59,342 normidna.nlp
10/06/2009 21:40 45,794 normnfc.nlp
10/06/2009 21:40 39,284 normnfd.nlp
10/06/2009 21:40 66,384 normnfkc.nlp
10/06/2009 21:40 60,294 normnfkd.nlp
10/06/2009 21:40 83,748 prc.nlp
10/06/2009 21:40 83,748 prcp.nlp
10/06/2009 21:40 262,148 sortkey.nlp
10/06/2009 21:40 20,320 sorttbls.nlp
10/06/2009 21:40 28,288 xjis.nlp
14 File(s) 5,582,046 bytes
Directory of C:\Windows\AppPatch\en-US
16/04/2011 03:24 <DIR> .
16/04/2011 03:24 <DIR> ..
20/11/2010 13:02 292,352 AcRes.dll.mui
1 File(s) 292,352 bytes
I wonder if someone could give me some advice? I am looking to import data from a text file based on pattern/text matching. For example process a text file similar to the below, looking to extract the field after "Directory of" and the corresponding text before "File(s)" and bytes.
So based on the file text file below, I would have three records:
Path | Files | Size |
C:\Windows\addins | 1 | 802 |
C:\Windows\assembly | 14 | 5,582,046 |
C:\Windows\AppPatch\en-US | 1 | 292,352 |
Carl
Directory of C:\Windows\addins
14/07/2009 06:32 <DIR> .
14/07/2009 06:32 <DIR> ..
10/06/2009 22:20 802 FXSEXT.ecf
1 File(s) 802 bytes
Directory of C:\Windows\assembly
12/05/2012 15:24 <DIR> .
12/05/2012 15:24 <DIR> ..
10/06/2009 21:39 66,728 big5.nlp
10/06/2009 21:39 82,172 bopomofo.nlp
10/06/2009 21:39 116,756 ksc.nlp
04/01/2012 04:34 4,567,040 mscorlib.dll
10/06/2009 21:40 59,342 normidna.nlp
10/06/2009 21:40 45,794 normnfc.nlp
10/06/2009 21:40 39,284 normnfd.nlp
10/06/2009 21:40 66,384 normnfkc.nlp
10/06/2009 21:40 60,294 normnfkd.nlp
10/06/2009 21:40 83,748 prc.nlp
10/06/2009 21:40 83,748 prcp.nlp
10/06/2009 21:40 262,148 sortkey.nlp
10/06/2009 21:40 20,320 sorttbls.nlp
10/06/2009 21:40 28,288 xjis.nlp
14 File(s) 5,582,046 bytes
Directory of C:\Windows\AppPatch\en-US
16/04/2011 03:24 <DIR> .
16/04/2011 03:24 <DIR> ..
20/11/2010 13:02 292,352 AcRes.dll.mui
1 File(s) 292,352 bytes
Tagged:
0
Answers
-
Hi all,
Almost there I can get it to process individual files and exact the information but not a single file containing multiple entries
Below is working the code and sample files, any help or hints would be greatly accepted, cheers,
Carl
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<parameter key="logverbosity" value="all"/>
<process expanded="true" height="100" width="145">
<operator activated="true" class="text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
<list key="text_directories">
<parameter key="Folder" value="D:\RapidMiner\New folder"/>
</list>
<parameter key="file_pattern" value="*.txt"/>
<parameter key="extract_text_only" value="false"/>
<parameter key="create_word_vector" value="false"/>
<parameter key="keep_text" value="true"/>
<process expanded="true" height="719" width="1022">
<operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information" width="90" x="447" y="30">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="Path" value="Directory of ([A-Za-z0-9:\\]*)"/>
<parameter key="Files" value="([0-9]*) File\(s\)"/>
<parameter key="Size" value="([0-9]*) bytes"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
File1.txt
Directory of C:\Windows\addins
14/07/2009 06:32 <DIR> .
14/07/2009 06:32 <DIR> ..
10/06/2009 22:20 802 FXSEXT.ecf
1 File(s) 802 bytes
File2.txt
Directory of C:\Windows\assembly
12/05/2012 15:24 <DIR> .
12/05/2012 15:24 <DIR> ..
10/06/2009 21:39 66,728 big5.nlp
10/06/2009 21:39 82,172 bopomofo.nlp
10/06/2009 21:39 116,756 ksc.nlp
04/01/2012 04:34 4,567,040 mscorlib.dll
10/06/2009 21:40 59,342 normidna.nlp
10/06/2009 21:40 45,794 normnfc.nlp
10/06/2009 21:40 39,284 normnfd.nlp
10/06/2009 21:40 66,384 normnfkc.nlp
10/06/2009 21:40 60,294 normnfkd.nlp
10/06/2009 21:40 83,748 prc.nlp
10/06/2009 21:40 83,748 prcp.nlp
10/06/2009 21:40 262,148 sortkey.nlp
10/06/2009 21:40 20,320 sorttbls.nlp
10/06/2009 21:40 28,288 xjis.nlp
14 File(s) 5,582,046 bytes
File3.txt
Directory of C:\Windows\AppPatch\en-US
16/04/2011 03:24 <DIR> .
16/04/2011 03:24 <DIR> ..
20/11/2010 13:02 292,352 AcRes.dll.mui
1 File(s) 292,352 bytes0 -
Hey Carly,
probably the Cut Document operator can give you the final boost to accomplish your task.
Best, Marius0 -
Hi Marcus,
Thank for the hint, I have managed to split up the main file into chunks and for each chunk, I can get three fields I need. However, the output is a IOObjectCollection list containing documents.
Any advise on how to convert/extract the values path, files, size into a nice table?
regards,
Carl
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<parameter key="logverbosity" value="status"/>
<process expanded="true" height="386" width="882">
<operator activated="true" class="text:read_document" compatibility="5.2.004" expanded="true" height="60" name="Read Document" width="90" x="84" y="39">
<parameter key="file" value="D:\RapidMiner\New folder\import.txt"/>
<parameter key="extract_text_only" value="false"/>
</operator>
<operator activated="true" class="text:cut_document" compatibility="5.2.004" expanded="true" height="60" name="Cut Document" width="90" x="246" y="75">
<parameter key="query_type" value="Regular Region"/>
<list key="string_machting_queries"/>
<parameter key="attribute_type" value="Binominal"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries">
<parameter key="Directory" value=" Directory of [A-Z]:\\\\.([0-9]*) bytes"/>
</list>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<process expanded="true" height="750" width="1022">
<operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information (3)" width="90" x="241" y="81">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="Path" value="Directory of ([A-Za-z0-9:\\]*)"/>
<parameter key="Files" value="([0-9]*) File\(s\)"/>
<parameter key="Size" value="([0-9,]*) bytes"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="segment" to_op="Extract Information (3)" to_port="document"/>
<connect from_op="Extract Information (3)" from_port="document" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Hi Carl,
try to move the Extract Information operator into a Process Documents operator of its own, as in the process below.
Best,
Marius<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.009">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.009" expanded="true" name="Process">
<parameter key="logverbosity" value="status"/>
<process expanded="true" height="403" width="413">
<operator activated="false" class="text:read_document" compatibility="5.2.005" expanded="true" height="60" name="Read Document" width="90" x="45" y="165">
<parameter key="file" value="D:\RapidMiner\New folder\import.txt"/>
<parameter key="extract_text_only" value="false"/>
</operator>
<operator activated="true" class="text:create_document" compatibility="5.2.005" expanded="true" height="60" name="Create Document" width="90" x="14" y="32">
<parameter key="text" value=" Directory of C:\Windows\addins 14/07/2009 06:32 <DIR> . 14/07/2009 06:32 <DIR> .. 10/06/2009 22:20 802 FXSEXT.ecf 1 File(s) 802 bytes Directory of C:\Windows\assembly 12/05/2012 15:24 <DIR> . 12/05/2012 15:24 <DIR> .. 10/06/2009 21:39 66,728 big5.nlp 10/06/2009 21:39 82,172 bopomofo.nlp 10/06/2009 21:39 116,756 ksc.nlp 04/01/2012 04:34 4,567,040 mscorlib.dll 10/06/2009 21:40 59,342 normidna.nlp 10/06/2009 21:40 45,794 normnfc.nlp 10/06/2009 21:40 39,284 normnfd.nlp 10/06/2009 21:40 66,384 normnfkc.nlp 10/06/2009 21:40 60,294 normnfkd.nlp 10/06/2009 21:40 83,748 prc.nlp 10/06/2009 21:40 83,748 prcp.nlp 10/06/2009 21:40 262,148 sortkey.nlp 10/06/2009 21:40 20,320 sorttbls.nlp 10/06/2009 21:40 28,288 xjis.nlp 14 File(s) 5,582,046 bytes Directory of C:\Windows\AppPatch\en-US 16/04/2011 03:24 <DIR> . 16/04/2011 03:24 <DIR> .. 20/11/2010 13:02 292,352 AcRes.dll.mui 1 File(s) 292,352 bytes"/>
</operator>
<operator activated="true" class="text:cut_document" compatibility="5.2.005" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">
<parameter key="query_type" value="Regular Region"/>
<list key="string_machting_queries"/>
<parameter key="attribute_type" value="Binominal"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries">
<parameter key="Directory" value=" Directory of [A-Z]:\\\\.([0-9]*) bytes"/>
</list>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<process expanded="true" height="403" width="299">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.2.005" expanded="true" height="94" name="Process Documents" width="90" x="313" y="30">
<process expanded="true" height="421" width="778">
<operator activated="true" class="text:extract_information" compatibility="5.2.005" expanded="true" height="60" name="Extract Information (3)" width="90" x="246" y="30">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="Path" value="Directory of ([A-Za-z0-9:\\]*)"/>
<parameter key="Files" value="([0-9]*) File\(s\)"/>
<parameter key="Size" value="([0-9,]*) bytes"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information (3)" to_port="document"/>
<connect from_op="Extract Information (3)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>0 -
All I can say is thanks and solved
Carl0