Monday, December 6, 2010

PHP - Managing Memory When Processing Large Files

Something interesting I found out about PHP when processing huge files. The memory garbage collector doesn't always work the way it is intended and common tricks don't always work either.

What's even more frustrating is the common method of reading a file line by line causes huge memory leaks.
Here's my findings and solutions: (feel free to correct me if I'm wrong, even though it worked for me)

Common method fopen+fgets fails:
fopen() with fgets() to read line by line on files that contains almost a million lines will cause crazy memory leak. It takes only 10 seconds before it consumes pretty much 100% of the system memory and go into swap.
   My solution:
   use "head -1000 {file} | tail -1000" is much less memory intensive. The exact number of lines to process varies depending on the system speed. I had it set to 2000 and was running very smoothly.

Garbage Collector fails:
PHP's garbage collector fails to clean up memory after each loop iteration even if unset() is used (or set variables to null). The memory just keep on piling up. Unfortunately "gc_collect_cycles" which forces the garbage collector cycle to run, is only available in PHP 5.3 branch.

Example Code:
for ($i=2000; $i<=1000000; $i+=2000) {
    $data = explode("\n", shell_exec("head -$i blah.xml | tail -2000"));
    //parse using simplexml
    unset($data);
}

My Solution
You can FORCE the garbage collector to run by wrapping a process in a function. PHP does clean up memory after each function call. So for the above code example, if re-written, memory will happily hover over 0.5% constantly.

Example Code:
for ($i=2000; $i<=1000000; $i+=2000) {
    $data = shell_exec("head -$i blah.xml | tail -2000");
    process($data);
    unset($data);
}

function process($data) {
    $data = explode("\n", $data);
    //parse using simplexml
    unset($data);
}

Friday, March 19, 2010

New Google Search on Chrome

Google certainly love their own browser more than any other browsers. Who doesn't love their own child?

Today, Google has changed the entire search experiences just for the Chrome. If you navigate to http://www.google.com, you'll get to see a slightly re-styled layout. But as you proceed on your search, you'll soon find out the search results are completely re-arranged, re-styled, and re-worked.

It looks like Google is taking a step up against Bing's decision driven search engine and also they took a step up in terms of user friendliness.
The left rail is now filled with search filters and options. The most used "Everything", "News", and "Maps" is very useful.

Here's a couple screenshots:


Thursday, March 4, 2010

WoW Account Restored

I have finally gotten my hacked WoW account restored by Blizzard. After lengthy investigation and restoration, Blizzard's "specialists" were kind of enough to restore most of my items. They tried the best they can to restore all my lost gears and for other items, they simply gave me 2500g per level 80 characters plus 14 emblems of frost and 70 emblems of triumph to cover the loss.

I am very pleased by Blizzard's customer service quality, except the fact that their hotline was nearly impossible to call in. Everything got restored just in time for ICC.

Now the biggest challenge I'm facing is squeezing time out for playing WoW and I'm slowly getting cornered by other priorities in my life. Time has become extremely precious for me. With LA's traffic, by the time I get back home, it is already 7:30PM or 8:00PM. After dinner and playing with my son, there's pretty much no time left for anything other than going to bed. It has been a few days since I have even touched my game machine. Strangely, I don't feel sad at all. I do miss WoW, but I value activities with my family far greater than quality WoW time. Yah, there it is, I've said it and maybe one day I will quit WoW as well (that will probably be a while :p)

Wednesday, March 3, 2010

SSH Clients on Windows

One of the thing I really love Mac and Linux over Windows is the built in Terminal. Unfortunately, majorities of computer users are using Windows. For programmers who work for larger corporations are pretty much stuck with Windows.

When it comes to using SSH on Windows, there aren't that many great choices out there. The built in Command-Prompt just doesn't cut it even if Microsoft somehow manages to have ssh support directly built in.

Currently, I'm switching between PuTTY and mintty. Maybe once I have figured out how to overcome windows 7's permission restrictions, I would drop PuTTY completely for mintty.

Here is a list of popular SSH clients on Windows:

  1. PuTTY
    This is probably the most popular SSH client on Windows. It's extremely light weight and straight forward to use. After almost 20 years in development, it is still in beta. There is a long wishlist and most of them have been on pending status for a long time, like the popular wish for having tabs. 
    However, tabbed PuTTY isn't a dream. Currently, there is an alternative solution. It's called PuTTY Connection Manager.
    A little Trick: Once you've downloaded putty.exe, move it to /windows/ directory. This way you can launch putty by simply type "putty" in the Start->Run prompt. 
  2. SecureCRT
    Although not free, but it does come with pretty much every feature you'll probably ever need from a ssh terminal. For larger corporation with enough budget, offering SecureCRT to programmers will definitely put some smiles on their face.
  3. mintty
    mintty is a small but excellent terminal emulator for Cygwin. I'm particularly not a huge fan of Cygwin, but mintty does offer some of the natural features that you'll find on Terminals available on Mac and Linux. Believe or not, mintty is based on code from PuTTY, so do expect that it doesn't have tabs.

Wednesday, February 3, 2010

PHP Session Storage

This is a continuation of my older article PHP Memcache Extension - Lesser Known Pitfalls. Reader Tony pointed out a very strong statement I made regarding not to use memcache as PHP session storage.
I have to admit that was some strong statements I made when I was writing that post. Thank you Tony for pointing that out. :) I have since changed the wording a bit to not deny the all usages of memcache in replacement of session.

Here's a couple situations I think are not good ideas when using memcache as the primary session storage:

  1. putting shopping cart data in memcache on a high volume e-commerce site. if the memory on the memcache server runs out or it crashes, your customers will instantly loose everything in the cart.
  2. putting user session data in memcache. this one is arguable. again, in a high volume situation, if memcache server goes away, just imagine how many queries will be triggered on your database.

I'm sure it's very arguable that none of those will be problems when using clustered memcache servers and have memcache servers installed with 64GB of memories (God forbid if all 64GB of data all got lost because of a power outage)

However, if the size of each session is controllable, growth of session data is foreseeable, and in case of failure, data is recoverable without triggering major disaster, I'm definitely pro on using memcache as the session storage. There are many ways to achieve this. Here's a couple ideas of mine:

  1. Controlling the size of each session can be easily achieved by optimizing your code. A lot of times I see people toss everything into a session (like user data) just in case a piece a data might be needed somewhere, sometime in the future. This leads to a lot of unnecessary data being stored in session. 
  2. Optimize the query that gets the data before it is set into session. In case of failure, you need to ensure that your database can handle the amount of traffic for recovering the data. If each lost session requires a major join query, you can well guess how long the database will last.
  3. Have a backup plan for the session data. If you have the luxury of using data storage mounts backed by NAS, utilize it. Put your session data in memcache for fast access, leave a copy on the NAS for faster and safer recovery. (remember, I don't mean leave a copy permanently)
  4. Again, if you have the luxury of using storage mounts backed by NAS, try use SQLLite for your user data. Each user will have his/her own SQLLite file. Whenever data needs to be retrieved, SQLLite file gets hit first. Imagine the load gets spread across thousands of disks. 
All in all, I'm not opposed to any form of session storages as long as all sides of it are well thought out and planned out.






Sunday, January 31, 2010

Fried Video Card

Things are just not heading towards North as I wished. Ever since I had my WoW account hacked, it seems like every little thing can turn South at the most unexpected time.

Just as I was trying to finish up the dev work on the three Chrome Extensions, my video card gave up on me. Sadly, I'm on a Dell XPS400 that was built 4 years ago and the video card is only a GeForce 6 series. I know, it's really a joke to most of  you guys. It's definitely a legacy video card now consider that it's only PCI-Express x16 without the buzzy word "2.0". I guess I should've replaced it when the fans started making noises....  Everything is too late now. It finally frozen up my screen and gave up on itself.

I'm actually writting this post from my wife's laptop which I'm not allowed to use for programming. It's a netbook strictly bought for her for reading online novels (those chinese love novels.... are they really that good?)

I have placed an order for a new video card, a GeForce 9800 GT, from newegg.com. It should arrive by tomorrow. For those of you who came from the support link on my extensions, please be patient. I'll be back on track in a couple days, promised.

Friday, January 22, 2010

First Tweet from Space



Just about 10 hours ago, Timothy J.(TJ) Creamer, a NASA Astronaut tweeted from the International Space Station. This marks as the 1st tweet from space.

According to the statement release today by NASA, "Astronauts aboard the International Space Station received a special software upgrade this week – personal access to the Internet and the World Wide Web via the ultimate wireless connection."

This personal Web access, called the Crew Support LAN, takes advantage of existing communication links to and from the station and gives astronauts the ability to browse and use the Web. The system will provide astronauts with direct private communications to enhance their quality of life during long-duration missions by helping to ease the isolation associated with life in a closed environment.


During periods when the station is actively communicating with the ground using high-speed Ku-band communications, the crew will have remote access to the Internet via a ground computer. The crew will view the desktop of the ground computer using an onboard laptop and interact remotely with their keyboard touchpad.


Ok, they don't have full access to internet all the time, but probably a few hours per day as the station orbits around the earth. Also, the access isn't directly through a dell/mac they brought with them, but rather through a computer on the ground which the astronaut will access using remote desktop. Still though, this is really cool. This gotta be the most interesting place to use twitter and marks the first step toward full internet access in space.

Now, here comes some fun facts from the statment:

Astronauts will be subject to the same computer use guidelines as government employees on Earth


This translates to as:  no porn, no WoW, and the other 100,000 noes from the government computer use guidelines handbook.