scaling drupal - an open-source infrastructure for high-traffic drupal sites

the authors of drupal have paid considerable attention to performance and scalability. consequently even a default install running on modest hardware can easily handle the demands of a small website. my four year old pc in my garage running a full lamp install, will happily serve up 50,000 page views in a day, providing solid end-user performance without breaking a sweat.

when the times comes for scalability. moving of of the garage

if you are lucky, eventually the time comes when you need to service more users than your system can handle. your initial steps should clearly focus on getting the most out of the built-in drupal optimization functionality, considering drupal performance modules, optimizing your php (including considering op-code caching) and working on database performance. John VanDyk and Matt Westgate have an excellent chapter on this subject in their new book, "pro drupal development"

once these steps are exhausted, inevitability you'll start looking at your hardware and network deployment.

a well designed deployment will not only increase your scalability, but will also enhance your redundancy by removing single points of failure. implemented properly, an unmodified drupal install can run on this new deployment, blissfully unaware of the clustering, routing and caching going on behind the scenes.

incremental steps towards scalability

in this article, i outline a step-by-step process for incrementally scaling your deployment, from a simple single-node drupal install running all components of the system, all the way to a load balanced, multi node system with database level optimization and clustering.

since you almost certainly don't want to jump straight from your single node system to the mother of all redundant clustered systems in one step, i've broken this down into 5 incremental steps, each one building on the last. each step along the way is a perfectly viable deployment.

tasty recipes

i give full step-by-step recipes for each deployment, that with a decent working knowledge of linux, should allow you to get a working system up and running. my examples are for apache2, mysql5 and drupal5 on debian etch, but may still be useful for other versions / flavors.

note that these aren't battle-hardened production configurations, but rather illustrative minimal configurations that you can take and iterate to serve your specific needs.

the 5 deployment configurations

the table below outlines the properties of each of the suggested configurations:
step 0step 1step 2step 3step 4step 5
separate web and dbnoyesyesyesyesyes
clustered web tiernonoyesyesyesyes
redundant load balancernononoyesyesyes
db optimization and segmentationnonononoyesyes
clustered dbnononononoyes
scalabiltypoor-poorfairfairgoodgreat
redundancypoor-poor-fairgoodfairgreat
setup easegreatgoodgoodfairpoorpoor-

step 0 - a basic drupal install

in step 0, i outline how to install drupal, mysql and apache to get a get a basic drupal install up-and-running on a single node. i also go over some of the basic configuration steps that you''ll probably want to follow, including cron scheduling, enabling clean urls, setting up a virtual host etc.


step 1 - a dedicated data server

in step 1, i go over a good first step to scaling drupal; creating a dedicated data server. by "dedicated data server" i mean a server that hosts both the database and a fileshare for node attachments etc. this splits the database server load from the web server, and lays the groundwork for a clustered web server deployment.


step 2 - sticky load balancing with apache mod_proxy

in step 2, i go over how to cluster your web servers. drupal generates a considerable load on the web server and can quickly become resource constrained there. having multiple web servers also increases the the redundancy of your deployment.


step 3 - using heartbeat to implement a redundant load balancer

in step 3, i discuss clustering your load balancer. one way to do this is to use heartbeat to provide instant failover to a redundant load balancer should your primary fail. while the method suggested below doesn't increase the loadbalancer scalability, which shouldn't be an issue for a reasonably sized deployment, it does increase your the redundancy.


step 4 - database segmentation using mysql proxy

in this article i discuss scaling the database tier up and out. i compare database optimization and different database clustering techniques. i go on to explore the idea of database segmentation as a possibility for moderate drupal scaling.


step 5 - the holy grail?

the holy grail of drupal database scaling might very well be a drupal deployment on mysql cluster. if you've tried this, plan to try this or have opinions on the feasibility of an ndb "port" of drupal, i'd love to hear it.

tech blog

if you found this article useful, and you are interested in other articles on linux, drupal, scaling, performance and LAMP applications, consider subscribing to my technical blog.

I'm creating a Drupal site

I'm creating a Drupal site that *may* receive a lot of traffic in the future. Should I decide the infrastructure from the outset or can I adapt the database and Drupal installation at a later date if I need to scale the site? Is the configuration of Drupal largely independent of the MySQL configuration? I'm particularly interested in the database as my application makes a lot of SELECT statements so I may need to scale at a later date.

BTW, Step 4 seems to lead to an invalid link.

my recommendation would be

my recommendation would be to build your site quickly, make a note of where you do anything that might have negative performance implications, and then if/when you receive a lot of traffic, make performance improvements to your code and infrastructure.

this, of course, assumes that you'll receive a fairly gradual increase in traffic and will have time to respond. if this is not the case, you need to carefully plan and simulate load in a sandbox environment. much more work.

I may be wrong about this,

I may be wrong about this, as I'm not deeply technical, but we looked in to scaling options and wrote off MySQL clustering. For reasons I forget right now and can't be bothered to investigate, MySQL clustering is great if you have a relatively small catalogue, but the chances are if you have been going for a few years your catalogue will be fairly large, and apparently it starts to suck with a big catalogue. All advantages are lost. =(

Drupal actually does far, FAR more select queries than DML "queries", so instead we considered replication - that is multiple "read" nodes behind a load balancer and a single "write" node with database replication happening on a gigabit LAN, making updates to the "write" node appear *almost* instantly on the "read" nodes. Instantly enough that users never know there was a slight delay.

Hi John, i host my drupal

Hi John,
i host my drupal page by bluehost. After I read this article, I think now rent own server :) thanks for tips!

I'm just brainstorming here,

I'm just brainstorming here, but what about using HA-JDBC for the database cluster? I confess, I'm a java-developer first with only a passing interest in PHP and Drupal. To make this work you would only need to make a PHP PEAR:DB driver that interacts with a JDBC driver. Downside is that you would be constricted by the functionality provided by the JDBC MySQL driver and HA-JDBC driver. Upside is, and I quote "An HA-JDBC database cluster can lose a node without failing/corrupting open transactions." If that's indeed correct, it's a very nice upside from MySQL's "the cluster can handle failures of individual data nodes with no other impact than that a small number of transactions are aborted due to losing the transaction state".

My company just went through

My company just went through a long and involved process evaluating Drupal and MySQL/NDB. The end result is that we do not believe this is a realistic solution for any Drupal installation. One thing about NDB is it is actually not optimized well for highly relational databases. This is obviously a huge problem for any Drupal installation, esp if you are highly dependent on CCK. It also still requires you to have enough RAM to hold your entire dataset, since it runs the entire database out of memory (this was supposed to go away in 5.1 but it is still a limitation.) On top of those things, there are a lot of other annoyances, like not being able to do schema changes without bringing the whole cluster into single-user mode, which would make it impossible to do things like install/update modules without scheduled downtime.

We never did actually get a cluster up with a Drupal implementation to see how bad the performance is (we decided it wasn't for us before getting to that point.) It would be interesting to find out in case they make it more flexible down the road. We're now going more for something like your step 4, which is probably the best anyone can hope for at this point.

thanks for the summary of

thanks for the summary of your experience. very interesting.

i'd be interested in how are you planning to do your segmentation i.e. what technologies are you planning to use and how are you planning to segment your database? ... maybe drop me a quick email if you get a moment? take it easy. john.

What are your opinions on

What are your opinions on opcaching in the Drupal env't? PHPAcceletator, eAccelerator, etc. I've heard mixed reviews about this. Thanks.

-Jay

jay, optimizing the

jay, optimizing the performance of your application (including op-code caching) and database tier are clearly very important for scalability. these are things that you'll almost certainly want to investigate before working on your infrastructure. however, since both of these are large subjects in themselves, they're outside the scope of the network infrastructure discussion. i'm planning to tackle these in upcoming blogs.

Great article, John. If you

Great article, John. If you have any rough rule-of-thumb on how many page views/mo. each Step can handle, that would be very helpful as well.

brian, it's really hard to

brian, it's really hard to say, since so much of that depends on the specifics of your drupal application the hardware you've chosen etc.

it would be nice to benchmark these configurations on reference hardware to at least determine relative load capacity of the various steps.

Great article. Luckily for

Great article. Luckily for me it has shown that I don't need to consider scaling yet for the websites I am working on.

Your Step 3 gets a little

Your Step 3 gets a little expensive,having a cluster sitting there with a single node just waiting for the first load balancers to fail. With heartbeat version 2, you can have more than 2 nodes. This means that you can have ALL your servers in a single cluster- all doing some work, including the db server! When I found out the possibilities with this scenario, it took me a couple of days to realize the implications:
1- all nodes do something
2- if the load balancer fails, then one of the other web servers *switches* roles and becomes the load balancer.
3- if additional servers fail, the remaining servers take up the slack

The advantage here is obvious: you don't *waste* 2 machines on load balancing.

Enjoy!
Dave

I have tried out a number of

I have tried out a number of software load-balancers (back to resonate clustering on solaris 2.6-8) and I feel that when you're just doing simple http distribution, just go ahead and use a piece of network hardware rather than doing it on computers. Chances are your hosting provider might even have it built into the switch you're using anyway, and if not, you can buy some old one on ebay. Doing load balancing in software just isn't worth futzing around with in my opinion, and hardware load balancers are much less likely to go out than the crappy nics in the ghetto 1Us one would probably use for lvs.

Come to think of it, if you have your webservers using heartbeats to do ip failover, you could just use round robin dns and skip the load-balancers altogether, as long as you're storing your php sessions in some shared location (e.g. nfs exported /mnt/phpsess)..

oliver, yea, if you've got

oliver, yea, if you've got the money to spend, i totally agree with you. dedicated hardware load-balancers are very easy to setup (including auto-failover), reliable and performant. they also typically come with good options for sticky session management etc. i've had luck using these to balance drupal systems.

as we all slowly drift towards virtualized environments, including hosted environments like amazon's ec2, i wonder if we'll get pushed back towards the software approach ...

dave, yea, good points about

dave, yea, good points about economizing on your hardware. i've tried to outline a simple, canonical configuration, that as you point out, is a bit wasteful in practice. i'm about to introduce another database load balancer as a separate machine, which again, hardly needs a dedicated piece of hardware.

using xen is a great way to keep this canonical configuration, but re-deploy the vm's on separate physical hardware as the need arises.

Wow! Very nice, senks!!!!

Wow! Very nice, senks!!!! Write owerview in my blog tomorrow

thanks aleks :)

thanks aleks :)

John, Nice little

John,

Nice little article..

It would be nice if you also included details on hosting...e.g. step 0 is hosted from your garage using 256k upload, step 1 is hosted remotely, step 2 is hosted remotely on dedicated host,,, step ? colocated host....

Tom

Please note, this entry has been closed to new comments.