A program to recognize and reward our most engaged community members
@sgenzer So Martin posted a comment in the related product idea that I think showed me the solution. You can use the "Edit Annotations" parameter to manually specify that certain lines are to be treated as comments (even if they don't have the comment character, which is where I was getting hung up before). See the following:
That effectively skips those lines on import and it starts reading on the specified line, and interprets the correct number of columns! And at that point, the extra work with the macros to rename the attributes isn't needed either, since you can specify the relevant line to be read as names--but I'm sure going to save that process anyways because it could come in handy in other more complex data files in the future. Thanks again!
josef.frank wrote:Hi,when importing csv files, is there any method to skip lines in the beginning of a file?Something that behaves like R's "read.table"-option "skip".Quite some time I'd like to read in csv files that have a structure like windows' ini-files:[Header]some garbage heresome non-relevant info here[Data]Var1,Var2,Var3,...0,1,0,.......I'd like to start reading the file in the line "Var1,Var2,.."Just can't find any option to get this done :-(BestJosef
<process version="5.1.006"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process"> <process expanded="true" height="370" width="145"> <operator activated="true" class="read_csv" compatibility="5.1.006" expanded="true" height="60" name="Read CSV" width="90" x="45" y="210"> <parameter key="csv_file" value="C:\Users\jedward\Desktop\RapidiCSVTest.csv"/> <parameter key="column_separators" value=","/> <parameter key="skip_comments" value="true"/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"> <parameter key="0" value="Comment"/> <parameter key="1" value="Name"/> <parameter key="2" value="Comment"/> <parameter key="3" value="Comment"/> <parameter key="4" value="Comment"/> <parameter key="5" value="Comment"/> </list> <list key="data_set_meta_data_information"/> </operator> <connect from_op="Read CSV" from_port="output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator></process>
JEdward wrote:Hi JEdwardThe below process worked fine for me. What have I missed?
none1,none2,none3nothingvar1,var2,var3,var4,var51,2,3,4,52,3,4,5,63,4,5,6,7
<?xml version="1.0" encoding="UTF-8" standalone="no"?><process version="5.1.006"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process"> <process expanded="true" height="370" width="279"> <operator activated="true" class="read_csv" compatibility="5.1.006" expanded="true" height="60" name="Read CSV" width="90" x="179" y="165"> <parameter key="csv_file" value="C:\Users\jedward\Desktop\TestFile.csv"/> <parameter key="column_separators" value=","/> <parameter key="skip_comments" value="true"/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"> <parameter key="0" value="Comment"/> <parameter key="1" value="Comment"/> <parameter key="2" value="Name"/> </list> <list key="data_set_meta_data_information"> <parameter key="0" value="var1.true.integer.attribute"/> <parameter key="1" value="var2.true.integer.attribute"/> <parameter key="2" value="var3.true.integer.attribute"/> <parameter key="3" value="var4.true.integer.attribute"/> <parameter key="4" value="var5.true.integer.attribute"/> </list> </operator> <connect from_op="Read CSV" from_port="output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator></process>
darn you, @Telcontar120 - now that you've thrown down the gauntlet I am going to tackle this. You know my weakness for bizarre ETL puzzles. Stay tuned.
Scott
how's this? test2.csv is attached
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process"> <process expanded="true"> <operator activated="true" class="set_macro" compatibility="7.6.001" expanded="true" height="68" name="Set Macro" width="90" x="45" y="34"> <parameter key="macro" value="skip"/> <parameter key="value" value="4"/> <description align="center" color="transparent" colored="false" width="126">set n rows to %{skip}</description> </operator> <operator activated="true" class="read_csv" compatibility="7.6.001" expanded="true" height="68" name="Read CSV" width="90" x="179" y="34"> <parameter key="csv_file" value="/Users/GenzerConsulting/Desktop/test2.csv"/> <parameter key="column_separators" value=","/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"/> <list key="data_set_meta_data_information"/> </operator> <operator activated="true" class="extract_macro" compatibility="7.6.001" expanded="true" height="68" name="Extract Macro" width="90" x="313" y="34"> <parameter key="macro" value="NEx"/> <list key="additional_macros"/> <description align="center" color="transparent" colored="false" width="126">get number of examples %{NEx}</description> </operator> <operator activated="true" class="generate_macro" compatibility="7.6.001" expanded="true" height="82" name="Generate Macro" width="90" x="447" y="34"> <list key="function_descriptions"> <parameter key="startRow" value="1+eval(%{skip})"/> </list> <description align="center" color="transparent" colored="false" width="126">%{startRow} = skip + 1</description> </operator> <operator activated="true" class="filter_example_range" compatibility="7.6.001" expanded="true" height="82" name="Filter Example Range" width="90" x="581" y="34"> <parameter key="first_example" value="%{startRow}"/> <parameter key="last_example" value="%{NEx}"/> <description align="center" color="transparent" colored="false" width="126">filter startRow to NEx</description> </operator> <operator activated="true" class="rename_by_example_values" compatibility="7.6.001" expanded="true" height="82" name="Rename by Example Values" width="90" x="715" y="34"/> <connect from_op="Read CSV" from_port="output" to_op="Extract Macro" to_port="example set"/> <connect from_op="Extract Macro" from_port="example set" to_op="Generate Macro" to_port="through 1"/> <connect from_op="Generate Macro" from_port="through 1" to_op="Filter Example Range" to_port="example set input"/> <connect from_op="Filter Example Range" from_port="example set output" to_op="Rename by Example Values" to_port="example set input"/> <connect from_op="Rename by Example Values" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator></process>
Brilliant, I really appreciate you working on this @sgenzer! Very creative use of macros to do this. This approach is promising, but I think it only works if there are enough delimiters in the first line so it reads the correct number of columns. If you modify the csv so the first line has only text (unfortunately more typical with title rows), then it only reads one column in, and the rest of the process works but only on that first column. And after a bit of testing it seems like it will generally read however many columns as it detects from the first row. So if you can figure out how to tell RapidMiner how many columns to read (not based on the first row of data), then something like this should do the trick.
Happy Thanksgiving!
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process"> <process expanded="true"> <operator activated="false" class="set_macro" compatibility="7.6.001" expanded="true" height="68" name="Set Macro" width="90" x="45" y="187"> <parameter key="macro" value="skip"/> <parameter key="value" value="4"/> <description align="center" color="transparent" colored="false" width="126">set n rows to %{skip}</description> </operator> <operator activated="true" class="read_csv" compatibility="7.6.001" expanded="true" height="68" name="Read CSV" width="90" x="179" y="34"> <parameter key="csv_file" value="/Users/genzerconsulting/Desktop/test2 mod.csv"/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"/> <list key="data_set_meta_data_information"/> </operator> <operator activated="true" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="34"> <list key="filters_list"> <parameter key="filters_entry_key" value="att1.contains.,"/> </list> </operator> <operator activated="true" class="split" compatibility="7.6.001" expanded="true" height="82" name="Split" width="90" x="447" y="34"/> <operator activated="false" class="extract_macro" compatibility="7.6.001" expanded="true" height="68" name="Extract Macro" width="90" x="581" y="187"> <parameter key="macro" value="NEx"/> <list key="additional_macros"/> <description align="center" color="transparent" colored="false" width="126">get number of examples %{NEx}</description> </operator> <operator activated="false" class="generate_macro" compatibility="7.6.001" expanded="true" height="82" name="Generate Macro" width="90" x="715" y="187"> <list key="function_descriptions"> <parameter key="startRow" value="1+eval(%{skip})"/> </list> <description align="center" color="transparent" colored="false" width="126">%{startRow} = skip + 1</description> </operator> <operator activated="false" class="filter_example_range" compatibility="7.6.001" expanded="true" height="82" name="Filter Example Range" width="90" x="849" y="187"> <parameter key="first_example" value="%{startRow}"/> <parameter key="last_example" value="%{NEx}"/> <description align="center" color="transparent" colored="false" width="126">filter startRow to NEx</description> </operator> <operator activated="true" class="rename_by_example_values" compatibility="7.6.001" expanded="true" height="82" name="Rename by Example Values" width="90" x="983" y="34"/> <connect from_op="Read CSV" from_port="output" to_op="Filter Examples" to_port="example set input"/> <connect from_op="Filter Examples" from_port="example set output" to_op="Split" to_port="example set input"/> <connect from_op="Split" from_port="example set output" to_op="Rename by Example Values" to_port="example set input"/> <connect from_op="Extract Macro" from_port="example set" to_op="Generate Macro" to_port="through 1"/> <connect from_op="Generate Macro" from_port="through 1" to_op="Filter Example Range" to_port="example set input"/> <connect from_op="Rename by Example Values" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator></process>
oh darn - of course I work on the next one and then see your note. Leave it to @mschmitz to one-up me. Perfect solution.
Happy Thanksgiving.
I don't see how the macros can manage a blank row in the csv file. Looking to skip the first 2 lines (first is comment, second is blank) without editing the csv file. I didn't have luck with blank row via annotations, but perhaps I overlooked a setting. Any help with skipping blank lines between comment and column names?
Hi @matt_yauch,
you can manually say in the operator parameters to skip first x lines and just read it. RM will then automatically detect the att names etc.
BR,
Martin