Monday, December 6, 2010

PHP - Managing Memory When Processing Large Files

Something interesting I found out about PHP when processing huge files. The memory garbage collector doesn't always work the way it is intended and common tricks don't always work either.

What's even more frustrating is the common method of reading a file line by line causes huge memory leaks.
Here's my findings and solutions: (feel free to correct me if I'm wrong, even though it worked for me)

Common method fopen+fgets fails:
fopen() with fgets() to read line by line on files that contains almost a million lines will cause crazy memory leak. It takes only 10 seconds before it consumes pretty much 100% of the system memory and go into swap.
   My solution:
   use "head -1000 {file} | tail -1000" is much less memory intensive. The exact number of lines to process varies depending on the system speed. I had it set to 2000 and was running very smoothly.

Garbage Collector fails:
PHP's garbage collector fails to clean up memory after each loop iteration even if unset() is used (or set variables to null). The memory just keep on piling up. Unfortunately "gc_collect_cycles" which forces the garbage collector cycle to run, is only available in PHP 5.3 branch.

Example Code:
for ($i=2000; $i<=1000000; $i+=2000) {
    $data = explode("\n", shell_exec("head -$i blah.xml | tail -2000"));
    //parse using simplexml
    unset($data);
}

My Solution
You can FORCE the garbage collector to run by wrapping a process in a function. PHP does clean up memory after each function call. So for the above code example, if re-written, memory will happily hover over 0.5% constantly.

Example Code:
for ($i=2000; $i<=1000000; $i+=2000) {
    $data = shell_exec("head -$i blah.xml | tail -2000");
    process($data);
    unset($data);
}

function process($data) {
    $data = explode("\n", $data);
    //parse using simplexml
    unset($data);
}

6 comments:

  1. This is quite interesting, I haven't found myself in the need of reading that big files, but I have asked myself how efficient would fread() be. I've been using file_get_contents() lately to process the result faster in a single line of code (mostly JSON formatted files). I'm also a daily WoW player (indeed I got to your blog from wowhead). Too bad you quit writing a year ago, you got very interesting material here. I hope you read my comment so you know there still are some people like me hauting this kind of info.

    Hope you be 'right and have found a way to harmonize both WoW and your family time.

    Greetings from Chile
    Sebastian McFindling

    PS: Look for my name in google so you can find my blog. I'm not posting the url so your blog wont block me.

    ReplyDelete
    Replies
    1. It is so luck to see the article today, great article, great author. I learn something new on different post everyday. It is always refreshing to read posts of other forums and learn something from them.
      Web development Company

      Delete
  2. Hello, I love reading through your blog, I wanted to leave a little comment to support you and wish you a good continuation. Wish you best of luck for all your best efforts.
    Tableau Guru
    http://www.sqiar.com/data-hosting/

    ReplyDelete
  3. wow! thanks man for sharing this conversation here is lots to learn in it about web development techniques again thanks and keep us update as you always. :)SMS Marketing Applications

    ReplyDelete

  4. Really enjoyed reading your article!))
    With humor, everything is clear and interesting examples of particularly good. That's what I want to share with you data room providers)

    ReplyDelete
  5. i like your post thanks for useful sharing ..:) sms marketing

    ReplyDelete