"Extract information from weblog (how to handle 31 text files for 3GB)"

Question

Hi all,

I am going to extract the IP and agent information from 31 files which is zipped (around 320MB)
steps as follows,
1 )  unzipped to 3GB text file   (seems zipped file cannot be read by rapidminer ???)
2 )  use read server log process ( it works fine for a little files only,
      It seems that the process read all files into RAM   :o   , but 3 GB text file cannot be handled well.....
      )
3)   Process : store to repository 
4)   Process : aggregate
5)   Process : export to CSV

can anyone give me tips please ;D

makchishing · Answer

THW Mark  wrote:
i'm not too familiar with the inner processes of rapidminer, but i assume that if you read a 3GB log file, it will occupy at least that amount in the memory. given that you are doing some manipulations/selections on the data, i would guess that you would need more RAM available then 3GB, more like 6. If you don't have 6GB RAM, you can split up the processes by splitting your log files.

Save the extracted data in to the repository (part1, part2, ... partn), and when you are done, combine the repository files to make the final entry containing your extracted data.

Thanks Mark ::)

Actually, I want to do a very simple work, i.e. to read thorugh the 3GB text, extract and aggregate some substring in it.
If the inner process of rapidminer is should be run process by process in RAM....(I mean..must read all text into RAM first), 
I would rather to read the 3GB text into database first and aggregate myself.

THW_Mark · Answer

i'm not too familiar with the inner processes of rapidminer, but i assume that if you read a 3GB log file, it will occupy at least that amount in the memory. given that you are doing some manipulations/selections on the data, i would guess that you would need more RAM available then 3GB, more like 6. If you don't have 6GB RAM, you can split up the processes by splitting your log files.

Save the extracted data in to the repository (part1, part2, ... partn), and when you are done, combine the repository files to make the final entry containing your extracted data.