Monday, December 6, 2010

PHP - Managing Memory When Processing Large Files

Something interesting I found out about PHP when processing huge files. The memory garbage collector doesn't always work the way it is intended and common tricks don't always work either.

What's even more frustrating is the common method of reading a file line by line causes huge memory leaks.
Here's my findings and solutions: (feel free to correct me if I'm wrong, even though it worked for me)

Common method fopen+fgets fails:
fopen() with fgets() to read line by line on files that contains almost a million lines will cause crazy memory leak. It takes only 10 seconds before it consumes pretty much 100% of the system memory and go into swap.
   My solution:
   use "head -1000 {file} | tail -1000" is much less memory intensive. The exact number of lines to process varies depending on the system speed. I had it set to 2000 and was running very smoothly.

Garbage Collector fails:
PHP's garbage collector fails to clean up memory after each loop iteration even if unset() is used (or set variables to null). The memory just keep on piling up. Unfortunately "gc_collect_cycles" which forces the garbage collector cycle to run, is only available in PHP 5.3 branch.

Example Code:
for ($i=2000; $i<=1000000; $i+=2000) {
    $data = explode("\n", shell_exec("head -$i blah.xml | tail -2000"));
    //parse using simplexml
    unset($data);
}

My Solution
You can FORCE the garbage collector to run by wrapping a process in a function. PHP does clean up memory after each function call. So for the above code example, if re-written, memory will happily hover over 0.5% constantly.

Example Code:
for ($i=2000; $i<=1000000; $i+=2000) {
    $data = shell_exec("head -$i blah.xml | tail -2000");
    process($data);
    unset($data);
}

function process($data) {
    $data = explode("\n", $data);
    //parse using simplexml
    unset($data);
}

1 comments:

  1. This is quite interesting, I haven't found myself in the need of reading that big files, but I have asked myself how efficient would fread() be. I've been using file_get_contents() lately to process the result faster in a single line of code (mostly JSON formatted files). I'm also a daily WoW player (indeed I got to your blog from wowhead). Too bad you quit writing a year ago, you got very interesting material here. I hope you read my comment so you know there still are some people like me hauting this kind of info.

    Hope you be 'right and have found a way to harmonize both WoW and your family time.

    Greetings from Chile
    Sebastian McFindling

    PS: Look for my name in google so you can find my blog. I'm not posting the url so your blog wont block me.

    ReplyDelete