scaling drupal step one B - nfs vs rsync

i got some good feedback on my dedicated data server step towards scaling. kris buytaert in his everything is a freaking dns problem blog points out that nfs creates an unnecessary choke point. he may very well have a point.

having said that, i have run the suggested configuration in a multi-web-server, high-traffic production setting for 6 months without a glitch, and feedback on his blog gives example of other large sites doing the same thing. for even larger configurations, or if you just prefer, you might consider another method of synchronizing files between your web servers.

kris suggests rsync as a solution, and although luc stroobant points out the delete problem, i still think it's a good, simple solution. see the diagram above.

the delete problem is that you can't simply use the --delete flag on rsync. since in an x->y synchronization, a delete on node x looks just like an addition to node y.

i speculate that you can partly mitigate this issue with some careful scripting, using a source-of-truth file server to which you first pull only additions from the source nodes, and then do another run over the nodes with the delete flag (to remove any newly deleted files from your source-of-truth). unfortunately you can't do the delete run on a live site (due to timing problems if additions happen after your first pass and before your --delete pass), but you can do this as a regularly scheduled maintenance task when your directories are not in flux.

i include a bash script below to illustrate the point. i haven't tested this script, or the theory in general. so if you plan to use it, be careful.

you could call this script from cron on your data server. you could do this, say, every 5 minutes for a smallish deployment. even though that this causes a 5 minute delay in file propagation, the use of sticky sessions ensures that users will see files that they create immediately, even if there is a slight delay for others. additionally, you could schedule it with the -d flag during system downtime.

the viability of this approach depends on many factors including how quickly an uploaded file must be available for everyone and how many files you have to synchronize. this clearly depends on your application.

synchronizeFiles -- a bash script to keep your drupal web server's files directory synchronized

#!/bin/bash

# synchronizeFiles -- a bash script to keep your drupal web server's files directory
#                     synchronized - http://www.johnandcailin.com

# bail if anything fails
set -e

# don't synchronize deletes by default
syncDeletes=false

sourceServers="192.168.1.24 192.168.1.25"
sourceDir="/var/www/drupal/files"
sourceUser="www-data"
targetDir="/var/drupalFiles"

# function to print a usage message and bail
usageAndBail()
{
   echo "Usage syncronizeFiles [OPTION]"
   echo "     -d       synchronize deletes too (ONLY use when directory contents are static)"
   exit 1;
}

# process command line args
while getopts hd o
do     case "$o" in
        d)     syncDeletes=true;;
        h)     usageAndBail;;
        [?])   usageAndBail;;
       esac
done

# do initial addition only schronization run from sourceServers to targetServer
for sourceServer in ${sourceServers}
do
   echo "bi directionally syncing files between ${sourceServer} and local"

   # pull any new files to the target
   rsync -a ${sourceUser}@${sourceServer}:${sourceDir} ${targetDir}/..

   # push any new files back to the source
   rsync -a ${targetDir} ${sourceUser}@${sourceServer}:${sourceDir}/..
done

# synchronize deletes (only use if directory contents are static)
if test ${syncDeletes} = "true"
then
   for sourceServer in ${sourceServers}
   do
      echo "DELETE syncing files from ${sourceServer} to ${targetDir}"

      # pull any new files to the target, deleting from the source of truth if necessary
      rsync -a --delete ${sourceUser}@${sourceServer}:${sourceDir} ${targetDir}
   done
fi

tech blog

if you found this article useful, and you are interested in other articles on linux, drupal, scaling, performance and LAMP applications, consider subscribing to my technical blog.

There is also csync2

There is also csync2 (http://oss.linbit.com/csync2/). Seems it's good tool for resolving problem of web servers synchronization.

What we really need is an

What we really need is an easy way to attach MogileFS to be the file repository for Drupal. For those of you that might not have been following MogileFS, check the link: http://www.danga.com/mogilefs/

There is also an 'experimental' Fuse binding for it so it will look like a 'real' filesystem. Have a look at this:
http://www.spicylogic.com/allenday/blog/2008/07/14/mogilefs-fuse-bigfile...

Here is a PHP binding for it as well:
http://projects.usrportage.de/index.fcgi/php-mogilefs

I am going to be testing this as a scalable storage platform for a planned high-performance Drupal deployment. MogileFS solves lots of problems including replication and disk I/O. If anyone else has used it with Drupal please drop a reply.

Hi Robert, Was just

Hi Robert,

Was just investigating use of mogileFS for our drupal front end deployments for scaling, so to share common drupal static and dynamic /files amongst all Drupal FE's. What have the results been till now? Will be greatful if you could share your findings.

Thanks
George

Robert, that sounds like a

Robert, that sounds like a really solid idea. I'd love to hear how it goes.

I'm afraid there is no way

I'm afraid there is no way to get this working on a production environment. Not only because of the timing issues (that will become worse if you add more webservers), but also because of the I/O and CPU load caused by rsync. Imagine you have to use this on 5 webservers with a large files directory, a large part of the server resources will be used just to keep everything in sync...
As long as you can't predict on which server ALL new files are uploaded, rsync won't scale this way.

luc, thanks for stopping

luc, thanks for stopping by.

i see you're a drupal on debian on xen (on debian) fan too :) god bless xen ...

and i agree with you about large deployments ... that's why i said that "the viability of this approach depends on many factors including how quickly an uploaded file must be available for everyone and how many files you have to synchronize. this clearly depends on your application."

there are plenty of drupal applications that this would work for. e.g. those that have say < 1000 images that need to get sync'd every few hours.

it clearly doesn't work (as you point out) for large file areas that need frequent synchronization.

... or you use OCFS2. OCFS2

... or you use OCFS2. OCFS2 totally spanks anything out there right now, both in terms of raw performance and in terms of ease of looking after. I used to look after a large webfarm, and all our troubles seemed so far away, once we moved to OCFS2.

Or you could use "unison",

Or you could use "unison", which is like rsync, except that it keeps state between runs, so it knows what's a delete, and what's an add - so can do bidirectional synchronization.

http://www.cis.upenn.edu/~bcpierce/unison/

yea, this does look very

yea, this does look very interesting. the stateless approach (rsync) gets very inefficient as either your directory size gets big or your synchronization period gets small.

post new comment

the content of this field is kept private and will not be shown publicly.
  • web page addresses and e-mail addresses turn into links automatically.
  • allowed html tags: <h2> <h3> <h4> <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • lines and paragraphs break automatically.
  • you may post code using <code>...</code> (generic) or <?php ... ?> (highlighted php) tags.

more information about formatting options

captcha
are you human? we hope so.
copy the characters (respecting upper/lower case) from the image.