How to work with a very large .csv file?
Hi folks, it looks like there has been a similar post or two in the past, but years old at this point so I thought it would be helpful to refresh...
I need to load a huge .csv file (4.72GB, ~23MM lines), and I need to break it up into smaller .csv files according to one polynomial attribute. This is public State of Texas data, so the attribute by which I want to split into smaller data sets is "County", and I want those new .csv files to be kicked out onto my local disk.
What's the best, most computationally efficient way to do this?
Thanks!!
cc @sgenzer
Answers
-
Hi @ncjanes,
This is an interesting topic. You can have several options. Some ideas off the top of my head.
- shell scripts, split the file by the value of a2
while IFS=, read -r a1 a2 a3 a4; do
Please refer to this post for examples
echo "$a1,$a2,$a3,$a4" >> "$a2".csv
done < file.csv - RDB(relational database), load csv data into mySQL or postgreSQL DB table and run in-database SQL query to select subsets
- scale up, if you have access to an environment loaded with big RAM then you can run RapidMiner process to load big csv data and split the tabular data with (a). "loop value" operator (enable parallel execution to make it run faster). Inside the loop, you can use the iteration macro with "filter example" to find subset and then "write csv" with macro defined file name to store individual small csv files. (b). "Group Into Collection" operator with "loop collection" and inside loop you can write each example into seperate csv
- scale out, HDFS or RDD are popular solution nowadays for really huge data (4.72GB is not that big)
HTH!
YY
1 - shell scripts, split the file by the value of a2
-
Hi,
Just my 5 cents... I would say the most computatiionaklly efficient way is the second suggested by YY.
You may break it down into two parts actually:
1) Use simple Python (or whatever else) script to read the arbitrarily large file in chunks and then save them into SQL table, see Python example here: http://odo.pydata.org/en/latest/perf.html
2) Then make SQL queries on subsets, using any SQL related tool, or RapidMiner.
This approach can easily let you have this data available for any future processing, including using it within different RM priocesses.
1 -
Hi,
this reminds me of an operator i wanted to implement: Stream Lines
Basically, it reads a file line by line and gives it to you as a document. You could then filter it and move forward.
Now i just need to find the time to implement this.
@kypexin : Do you think this is useful?
~Martin
4 -
Hi Martin,
Well, some time ago I was asking more or less the same question about processing long csv: https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Partial-retrieve-of-example-set/m-p/42320
So that kind of operator you describe would be really helpful in my opinion. At least it can give an alternative option for handling such data.
2 -
Hi @kypexin,
well doing something like this on a repository item is harder, since these are either DB tables or serialized java objects. Doing it on flat files is easier..
~Martin
1 -
But this is still a good alternative solution for the problem So I think I would use such an operator!
1 -
I think it would be a very useful operator to add! I still miss the similar "Stream DB" which used to exist but was deprecated for some unknown reason back in v 7 somewhere...
2 -
The most elegant and RapidMiner-like solution would be a two-part operator like Handle Exception.
In the left part you could put an operator returning an example set (like Read Database, Read CSV, Read Hive). The output would be processed by the right part in batches of selectable size.
The input operators could be updated to know about the batch size (internal parameter?) and would adhere to it, saving memory in the process. Operators that aren't yet updated would just return everything, so this would be like Loop Batches with the data source outside.
Balázs
1 -
that would requiere that the users specifies his batching manually on the left hand side?
If yes - you can do this with a normal Loop operator?
Best,
Martin
0 -
As long as the input operators don't have the optional batch size, this is just an elegant frontend for our existing loops or loop batches.
My idea is more futuristic.
0