Scaling PHP Apps

Scaling PHP Apps
Scaling PHP Apps by Steve Corona
English | 2015 | ISBN: n/a | 210 Pages | True PDF, EPUB | 12 MB

Scaling LAMP doesn’t have to suck. You’re not exactly the next Facebook, but you’re big enough, and it costs real money when your site goes down in the middle of the night.

+When I started at Twitpic, I quickly learned how big of a mess it was. The infrastructure was a mix of different servers and FreeBSD versions, backed up by a steaming pile of PHP-inlined HTML. Remember, Twitpic was built in a Red Bull fueled weekend, a side project for Noah to share pictures with a couple of his friends. It wasn’t meant win the prettiest-code beauty pageant. But, wow, it was bad. Think of the type of PHP your 14 year old little-brother might write— spaghetti, pagename.php files, no MVC framework, and a handful of include_once’s at the top of each file.
Ditching the LAMP Stack
What’s wrong with LAMP?
LAMP (Linux, Apache, MySQL, PHP) is the most popular web development stack in the world. It’s robust, reliable and everyone knows how to use it. So, what’s wrong with LAMP? Nothing. You can go really far on a single server with the default configurations. But what happens when you start to really push the envelope? When you have so much traffic or load that your server is running at full capacity?
You’ll notice tearing at the seams, and in a pretty consistent fashion too. MySQL is always the first to go—I/O bound most of the time. Next up, Apache. Loading the entire PHP interpreter for each HTTP request isn’t cheap, and Apache’s memory footprint will prove it. If you haven’t crashed yet, Linux itself will start to give up on you—all of those sane defaults that ship with your distribution just aren’t designed for scale.
What can we possibly do to improve on this tried-and-true model? Well, the easiest thing is to get better hardware (scale vertically) and split the components up (scale horizontally). This will get you a little further, but there is a better way. Scale intelligently. Optimize, swapping pieces of your stack for better software, customize your defaults and build a stack that’s reliable and fault tolerant. You want to spend your time building an amazing product, not babysitting servers.
The Scalable Stack
After lots of trial and error, I’ve found what I think is a generic, scalable stack. Let’s call it LHNMPRR… nothing is going to be as catchy as LAMP!
Linux
We still have Old Reliable, but we’re going to tune the hell out of it. This book assumes the latest version of Ubuntu Server 12.04, but most recent distributions of Linux should work equally as well. In some places you may need to substitute apt-get with your own package manager, but the kernel tweaks and overall concepts should apply cleanly to RHEL, Debian, and CentOS. I’ll include kernel and software versions where applicable to help avoid any confusion.
HAProxy
We’ve dumped Apache and split its job up. HAProxy acts as our load balancer—it’s a great piece of software. Many people use nginx as a load balancer, but I’ve found that HAProxy is a better choice for the job. Reasons why will be discussed in-depth in Chapter 3.
nginx

PHP 5.5 / PHP-FPM
There are several ways to serve PHP applications: mod_php, cgi, lighttpd fastcgi??. None of these solutions come close to PHP-FPM, a FastCGI Process Manager written by the nginx team that’s been bundled with PHP since 5.3. What makes PHP-FPM so awesome? Well, in addition to being rock-solid, it’s extremely tunable, provides real-time stats and logs slow requests so you can track and analyze slow portions of your codebase.
MySQL
Most folks coming from a LAMP stack are going to be pretty familiar with MySQL, and I will cover it pretty extensively. For instance, in Chapter 5 you’ll learn how you can get NoSQL performance out of MySQL. That being said, this book is database agnostic, and most of the tips can be similarly applied to any database. There are many great databases to choose from: Postgres, MongoDB, Cassandra, and Riak to name a few. Picking the correct one for your use case is outside the scope of this book.
DNS: Why you need to care
The first layer that we are going to unravel is DNS. DNS? What!? I thought this was a book on PHP? DNS is one of those things that we don’t really think about until it’s too late, because when it fails, it fails in the worst ways.
Don’t believe me? In 2009 Twitter’s DNS was hijacked and redirected users to a hacker’s website for an hour. That same year, SoftLayer’s DNS system was hit with a massive DDoS attack and took down their DNS servers for more than six-hours. As a big SoftLayer customer, we dealt with this firsthand because (at the time) we also used their DNS servers.
The problem with DNS downtime is that it provides the worst user experience possible—users receive a generic error, page timeouts, and have no way to contact you. It’s as if you don’t exist anymore, and most users won’t understand (or likely care) why.
As recently as September 2012, GoDaddy’s DNS servers were attacked and became unreachable for over 24-hours. The worst part? Their site was down too, so you couldn’t move your DNS servers until their website came back up (24-hours later).
Too many companies are using their domain registrar or hosting provider’s DNS configuration — that is WRONG! Want to know how many of the top 1000 sites use GoDaddy’s DNS? None of them.
So, should I run my own DNS server?
It’s certainly an option, but I don’t recommend it. Hosting DNS “the right way” involves having many geographically dispersed servers and an Anycast network. DDoS attacks on DNS are extremely easy to launch and if you half-ass it, your DNS servers will be the Achilles’ heel to your infrastructure. You could have the best NoSQL database in the world but it won’t matter if people can’t resolve your domain.
Almost all of the largest websites use an external DNS provider. That speaks volumes as far as I’m concerned—the easiest way to learn how to do something is to imitate those that are successful.

Case Study: HTTP Caching and the Nginx Fastcgi Cache
This entire book has been, for the most part, dedicated to scaling our backend services— making our frontend faster by improving server performance, database caching, and getting the result back to the client as fast as possible. Generally this is a good strategy, but we’ve skipping talking about an entirely different side of caching, HTTP Caching. What if we could have the client cache our output, in the browser or api client, and skip the trip to the server all-together? That would be most excellent.
This is exactly what HTTP Caching does— it provides a mechanism for us to define rules on when the client should check back for a new version in the future.