Reading data using field name
I am read a file into RM where there is no header row, each field has the name included in the filed value.
So where a typical CSV file would be:
ice_cream ,chocolate, candy
1,4,5
6,4,2
My files looks like:
"ice_cream"="1","chocolate"="4","candy"="5"
"ice_cream"="6","chocolate"="4","candy"="2"
Various other data mining programs allow for the "retain name" function, how does one deal with this inside of RapidMiner?
The problem that I face is that these files are large, reading them in retaining the field information and replacing it later with an operator uses more than the available system memory.
So where a typical CSV file would be:
ice_cream ,chocolate, candy
1,4,5
6,4,2
My files looks like:
"ice_cream"="1","chocolate"="4","candy"="5"
"ice_cream"="6","chocolate"="4","candy"="2"
Various other data mining programs allow for the "retain name" function, how does one deal with this inside of RapidMiner?
The problem that I face is that these files are large, reading them in retaining the field information and replacing it later with an operator uses more than the available system memory.
Find more posts tagged with
Sort by:
1 - 5 of
51
In Linux I would use the stream editor and do:
sed 's/"ice_cream"="/g'
But this is a a windows machine I am working on.
sed 's/"ice_cream"="/g'
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.002">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="productivity:execute_program" compatibility="9.0.002" expanded="true" height="103" name="Execute Program" width="90" x="246" y="136">
<parameter key="command" value="sed 's/"ice_cream"="/g'"/>
<parameter key="working_directory" value="/Users/robinmeisel/sweets/sweets.flatfile.1"/>
<list key="env_variables"/>
</operator>
<connect from_op="Execute Program" from_port="out" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
But this is a a windows machine I am working on.
Hi @robin!
If the data is to be big to fit in in one go, you could try to do a more "manual" approach. As I described in this thread, you can use the text extension to split the csv files into lines and the lines into separate values. It should also be possible to then modify each cell value before it is put into an example set.
Cheers
Jan
Sort by:
1 - 1 of
11
Hi @robin!
If the data is to be big to fit in in one go, you could try to do a more "manual" approach. As I described in this thread, you can use the text extension to split the csv files into lines and the lines into separate values. It should also be possible to then modify each cell value before it is put into an example set.
Cheers
Jan
this format looks very wired. Why is this being used? It produces a ton on overhead while storing it.
Anyway, is the ordering always the same? If yes, you can just read it as polynominals and replace.
BR,
Martin