lamp performance on the elastic compute cloud: benchmarking drupal on amazon ec2

amazon's elastic compute cloud, "ec2", provides a flexible and scalable hosting option for applications. while ec2 is not inherently suited for running application stacks with relational databases such as lamp, it does provide many advantages over traditional hosting solutions.

in this article we get a sense of lamp performance on ec2 by running a series of benchmarks on the drupal cms system. these benchmarks establish read throughput numbers for logged-in and logged-out users, for each of amazon's hardware classes.

we also look at op-code caching, and gauge it's performance benefit in cpu-bound lamp deployments.

the elastic compute cloud

amazon uses xen based virtualization technology to implement ec2. the cloud makes provisioning a machine as easy as executing a simple script command. when you are through with the machine, you simply terminate it and pay only for the hours that you've used.

ec2 provides three types of virtual hardware that you can instantiate. these are summarized in the table below.

machine typehourly costmemorycpu unitsplatform
small Instance$0.101.7 GB132-bit platform
large Instance$0.407.5 GB464-bit platform
extra large Instance$0.8015 GB864-bit platform
note: one compute unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

target deployments

to keep things relatively simple, the target deployment for our load test is basic; the full lamp stack runs on a single server. this is step zero in the five deployment steps that i outlined in an open-source infrastructure for high-traffic drupal sites.

our benchmark

our benchmark consists of a base drupal install, with 5,000 users and 50,000 nodes of content-type "page". nodes are an even distribution of 3 sizes, 1K, 3K and 22K. the total database size is 500Mb.

during the test, 10 threads read nodes continually over a 5 minute period. 5 threads operate logged-in. the other 5 threads operate anonymously (logged-out). each thread reads nodes randomly from the pool of 50,000 available.

this test is a "maximum" throughput test. it creates enough load to utilize all of the critical server resource (cpu in this case). the throughput and response times are measured at that load. tests to measure performance under varying load conditions would also be very interesting, but are outside the scope of this article.

the tests are designed to benchmark the lamp stack, rather than weighting it towards apache. consequently they do not load external resources. that is, external images, css javascript files etc are not loaded, only the initial text/html page. this effectively simulates drupal running with an external content server or cdn.

the benchmark runs in apache jmeter. jmeter runs on a dedicated small-instance on ec2.

benchmarking is done with op-code caching on and off. since our tests are cpu bound, op-code caching makes a significant difference to php's cpu consumption.

our testing environment

the tests use a debian etch xen instance, running on ec2. this instance is installed with:
  • MySQL: 5.0.32
  • PHP: 5.2.0-8
  • Apache: 2.2.3
  • APC: 3.0.16
  • Debian Etch
  • Linux kernel: 2.6.16-xenU

the tests use a default drupal installation. drupal's caching mode is set to "normal". no performance tuning was done on apache, mysql or php.

the results

all the tests ran without error. each of the tests resulted in the server running at close to 100% cpu capacity. the tests typically reached steady state within 30s. throughputs gained via jmeter were sanity checked for accuracy against the http and mysql logs. the raw results of the tests are shown in the table below.
instanceapc?logged-in throughputlogged-in responselogged-out throughputlogged-out response
smalloff1941.506640.45
largeoff6390.462,7030.11
xlargeoff1,3600.203,7410.08
smallon9050.303,8380.07
largeon3,1060.108,0330.04
xlargeon4,6530.0612,5480.02

note: response times are in seconds, throughputs are in pages per minute

the results - throughput

the throughput of the system was significantly higher for the larger instance types. throughput for the logged-in threads was consistently 3x lower than the logged-out threads. this is almost certainly due to the drupal cache (set to "normal").

throughput was also increased by about 4x with the use of the apc op-code cache.


the results - response times

the average response times were good in all the tests. the slowest tests yielded average times of 1.5s. again, response times where significantly better on the better hardware and reduced further by the use of apc.


conclusions

drupal systems perform very well on amazon ec2, even with a simple single machine deployment. the larger hardware types perform significantly better, producing up to 12,500 pages per minute. this could be increased significantly by clustering as outlined here.

the apc op-code cache increases performance by a factor of roughly 4x.

these results are directly applicable to other cpu bound lamp application stacks. more consideration should be given to applications bound on other external resources, such as database queries. for example, in a database bound system, drupal's built-in cache would improve performance more significantly, creating a bigger divergence in logged-out vs logged-in throughput and response times.

although performance is good on ec2, i'm not recommending that you rush out and deploy your lamp application there. there are significant challenges in doing so and ec2 is still in beta at the time of writing (Jan 08). it's not for the faint-of-heart. i'll follow up in a later blog with more details on recommended configurations.

update 22 May 2008: check out my ec2 update for some new lamp-friendly additions to ec2.

update 26 Aug 2008: more good stuff in the ebs ec2 update.

tech blog

if you found this article useful, and you are interested in other articles on linux, drupal, scaling, performance and LAMP applications, consider subscribing to my technical blog.

resources



jmeter on ec2 Hi John, Did

jmeter on ec2

Hi John,

Did you get distributed testing working (that is running jmeter servers in the cloud, and a jmeter client locally)? I've been trying, but because ec2 uses its private IP and tells everything that that is the IP, it is creating major difficulties for me... I tried everything, but just can't get it to boot up with the public IP. Let me know if you sorted this.

P.S. I have no problem running a test on the server, but I'd prefer to run it from the client using the distributed test functionality

Thanks,
Jacob

jacob. i didn't experiment

jacob. i didn't experiment with distributed tests on ec2. good luck with your issues.

amazon have recently made

amazon have recently made some encouraging announcements on ec2. take a look at my ec2 update for more information.

continued from here >

continued from here

> True...estimating that is tough!

> Pardon my ignorance here, but why do you say ec2 is not set up for rdbmses (cf your other post),
> or db driven apps? Is it because of the virtual infrastructure not being able to eke out the
> performance you might get, or is it because the motivation behind the service is to not have
> always-on server-type apps?

> At the outset, it looks like it would make life a lot easier to gracefully scale as load
> grows (using your examples > of scaling via load-b heartbeats and clustered dbs).

> Thanks!

yea, you are absolutely right, that ec2 makes it easy to scale many applications gracefully.

the big problem with ec2 is the lack of persistent storage, i.e. if you run a lamp stack on your amazon instance and, for example, accidentally shut it down after it's been running for 2 days, you'll loose everything. every piece of data in the database, every configuration change you've made to the system. clearly you can "solve" this by doing regular backups and moving them off the machine to e.g. amazon's s3 or remote storage, but how often will you take a backup? every 12 hours? you are still exposed to some serious data loss ....

so to solve it properly, you need to do something like pseudo real-time replication of your primary system to a standby database, and then backup the standby database every 1/2 hour, moving the standby backups off site. that way, you can suffer a catastrophic failure of your main system, and automagically cut over to your standby system. in that scenario, you are running without a secondary-standby system, so you need to be very quickly notified that you are running on the standby and quickly rebuild and move back to your primary system. ideally you'd never allow yourself to run without a standby, so it may be prudent to use multiple hot backup systems.

this is all do-able, but my feeling is that it makes more sense to host on traditional providers until amazon improve their offering for systems that rely on persistent storage.

Amazon finally understand

Amazon finally understand mysql problem with Elastic Block Store.
I delicied a bookmark on a technical article on "Running MySQL on Amazon EC2 with Elastic Block Store"

The url is here : http://developer.amazonwebservices.com/connect/entry.jspa?externalID=166...

Someone can test it ?

Nice - thanks once again for

Nice - thanks once again for the detail! I'm curious if you've tried any of the existing offerings out there to provide persistent storage. I noticed that PersistentFS and ElasticDrive have free trials, and ED has a 5GB free drive.

I'm guessing it is only a matter of time before either Amazon comprehensively addresses this sucker, or we have a FUSE or something that can address this issue easily.

Keep the posts coming!

Cheers.

This benchmark could use a

This benchmark could use a comparison against a small dedicated server implementation. My own experience was that a single user on an ec2 environment would take 5 to 10 times longer to get a response than on a dedicated environment. This is fine if your dedicated response times for a single user are on the order of .01 seconds. If your application is heavy on large datasets, and your single user response times are something on the order of one or two seconds in a dedicated environment, then this would put ec2 response times around 10 to 20 seconds which was the experience. Scaling can maintain the 10 to 20 seconds each for many users, but it sure can't improve it.

thanks for your comments. it

thanks for your comments. it sounds like you got quite different results from me. how did you run your benchmarks?

in these tests, xlarge amazon instances were producing response times of 0.02s giving a throughput of 12,500 pages/min using 10 threads to generate requests.

the slowest response times that i saw were 1.5s for logged in users on the small instance at a throughput of around 200 pages/min

the anecdotal performance of the drupal site when accessed using my dsl line was very good, even while the benchmarks were running.

regarding your "small dedicated servers" question. when i tested non-virtual servers similar to the power of the small amazon instance, they produced results very similar to the small amazon instance.

john, could you post your

john, could you post your jmeter file so we all do the same test?

Thanks

matias, certainly.

matias, certainly. unfortunately I won't have access to the system that they are on until next week. i'll post them in the next week or so.

i have a couple of jmeter files, one that populates the system with the test data, and one that runs the actual test.

john, we made some tests in

john, we made some tests in a similar enviroment. (debian etch xen instance, running on small ec2 + eaccelator) but we're not having the same results for the same test. Our results are the following:

logged-in throughput: 976.42
logged-in response: 305 ms
logged-out throughput: 217.93 req/min
logged-out response: 275 ms

I don't know if we're making different tests or we are missing something in our configuration. What would you say?
Is your instance a public one? Can you give as an ami so we can create an image from it?
I'll appreciate any help!

Matias.

matias, you results look a

matias, you results look a bit wacky to me. your logged out throughput should always be higher than your logged in throughput with drupal caching on. secondly, your throughput should be very roughly proportional to your response time, but you have an inverse relationship.
  1. are you sure you turned drupal caching on?
  2. did you run your tests long enough to stabilize? e.g. 5+ minutes

Nevermind, my tests were

Nevermind, my tests were wrong. Anyway, I do not approach to your numbers in Fedora. I do it in Debian. Do you know if this is possible? Debian is around a 20% faster than Fedora.

Please note, this entry has been closed to new comments.