using a guardian to ensure your lamp site is always up

to guarantee maximum uptime for your site, it's a good idea to periodically check the health of your system and restart failing components. you can use a simple program to do this automatically. i like to call this type of program, a "guardian".

clearly guardians shouldn't be used as a crutch for a badly configured system. used appropriately, however, they can decrease downtime due to unexpected events or administrator-error.

in this article, i describe how to implement, install and configure a guardian using a lightweight bash script. i go on to describe how to watch over your lamp install using this guardian. please note that all code and configurations have been tested on debian etch but should be useful for other *nix flavors with subtle modifications.

a simple guardian implementation for lamp

our guardian is a simple shell script that takes a few command line parameters. it also reads a configuration file on startup defining what checks should be run.

this design allows the definition of new checks by simply adding lines to a configuration file. entirely new check types are created by adding standard "handler" functions to the script. more on that later.

the guardian runs each check in turn, restarting the corresponding daemon it fails. once the guardian is done with all checks, it reschedules another instance of itself to run later, using the at scheduler.

configuring the guardian

checks are statically defined in a check file. an example check file might look like:
httpCheck,    apache2,  apache2,        http://www.example.com, "login here"
this check specifies to monitor the url http://www.example.com for the string "login here", and restart apache2 if it isn't found. lamp installs can benefit from a guardian periodically checking the health of both mysql and apache. see the end of this article for a specification of a lamp check file.

installing the guardian

to install the guardian, copy the guardian source (below) into a file called "guardian" and the lamp checks source (below) into a file called "guardianChecks". now install these into an appropriate place, for example:
# cp guardian /usr/bin ; chmod u+x /usr/bin/guardian
# cp guardianChecks /usr/lib

running the guardian

you can invoke the guardian on the command line as follows:
# guardian -c /usr/lib/guardianChecks
and terminate it with:
# guardian -t
you can always get a full help message with guardian -h.

running the guardian from cron

you should consider running the guardian from cron, to ensure that it will be reinvoked if accidentally terminated. constantly re-invoking the guardian is safe, since if the guardian is already running, subsequent invocations are simply ignored.

an example cron configuration is as follows:

# make sure that the guardian is still running, once an hour
05 * * * * /usr/bin/guardian -b -c /usr/lib/guardianChecks  > /dev/null 2>&1
this allows you to terminate the guardian during system maintenance, safe in the knowledge that should you forget to turn it back on, it will revive itself.

guardian logging

the guardian logs to syslog. the initial invocation also outputs to stderr, allowing you to see the first pass at each check running.

if you'd like to change the logfile from the default /var/log/syslog to a custom file e.g. /var/log/guardian then add the following to the file /etc/syslog.conf:

# guardian logging: log all local0's messages to /var/log/guardian
local0.*                        /var/log/guardian
you should see log statments like the following:
# cat /var/log/guardian
Dec 17 15:55:00 myhost guardian: PASS: logfile (/var/log/apache2/access.log) clean: line 27846 to EOF
Dec 17 15:55:01 myhost guardian: PASS: found "login here" in http://www.example.com
Dec 17 15:55:01 myhost guardian: PASS: executed "select name from users limit 1" in localhost
Dec 17 15:55:01 myhost guardian: scheduling another check to run in 10 minutes

when something fails

when the guardian detects a problem and restarts a daemon, you should see log statements similar to the following.
Dec 17 16:05:01 myhost  guardian: PASS: logfile (/var/log/apache2/access.log) clean: line 28120 to EOF
Dec 17 16:05:01 myhost guardian: FAIL: could NOT LOCATE "login here" in http://www.example.com
Dec 17 16:05:02 myhost guardian: CRITICAL PROBLEMS with deamon (apache2), RESTARTING.
Dec 17 16:05:04 myhost guardian: daemon (apache2) restarted.
Dec 17 16:05:05 myhost guardian: PASS: executed "select name from users limit 1" in localhost
Dec 17 16:05:05 myhost guardian: scheduling another check to run in 10 minutes

another word on syslog

all the guardian log statements are issued on facility local0 at priority debug, warn, or crit. you can configure syslog to map these messages to any output that you choose for example: mail, console or remote machine. see man syslog.conf for more information. i wrote a detailed how-to article on syslog email integration that you might find useful if you want real-time messages informing you of critical guardian activity.

new check types

a new check type (e.g. postgresqlCheck()) can easily be added to the script by defining a new handler function similar to mysqlCheck() and httpCheck(). these handler functions implement a simple interface: handlerFunction(checkSource, checkParameters).

example lamp checkfile source

the checkfile below shows a configuration that you might use for a lamp install. the string "really bad segfault" is included to illustrate the point. in practice, if you were using an unstable op-code cache, you might look for "exit signal Segmentation fault" in apache's error log, as discussed in the 2bits article on php op-code cache issues.
#
# a checkfile for the guardian
#
# currently 3 types of check are supported: logfileCheck, httpCheck, mysqlCheck
#
# each check takes 3 general arguments
#    - checkType: the type of the check
#    - deamonName: the name of the executable to be restarted in /etc/init.d
#    - executableName: the name of the daemons executable
#
# each check takes 2 arguments specific to the check: checkSource and checkParameters.
#    these have the following values for each check
#
# logfileCheck - check for the presence of a expression in a logfile
#    - checkSource: logfile name
#    - checkParameters: expression to check for
#
# httpCheck - check for the presence of a expression in an http response
#    - checkSource: URL name
#    - checkParameters: expression to check for
#
# mysqlCheck - check for a response to a mysql statement
#    - checkSource: connect string
#    - checkParameters: sql to check
#
#check type, daemonName, executableName, checkSource, checkParameters
logfileCheck, apache2,  apache2,        /var/log/apache2/access.log, "really bad segfault"
httpCheck,    apache2,  apache2,        http://www.example.com, "login here"
mysqlCheck,   mysql,    mysqld,         mysql://drupal:password@localhost/exampledb, "select name from users limit 1"

the guardian source

here's the source for the guardian script. please feel free to use, modify, plagiarize, hack up any of this bash code as your mood takes you.
#!/bin/bash

# guardian - a script to watch over application system dependences, restarting things
#            as necessary:  http://www.johnandcailin.com/john
#
#            this script assumes that at, logger, sed and wget are available on the path.
#            it assumes that it has permissions to kill and restart deamons including
#            mysql and apache.
#           
#            Version: 1.0:    Created
#                     1.1:    Updated logfileCheck() not to assume that files are rotated
#                             on restart.

checkInterval=10                         # MINUTES to wait between checks

# some general settings
batchMode=false                          # was this invoked by a batch job
terminateGuardian=false                  # should the guardian be terminated

# setting for logging (syslog)
loggerArgs=""                            # what extra arguments to the logger to use
loggerTag="guardian"                     # the tag for our log statements

# the at queue to use. use "g" for guardian. this queue must not be used by another
# application for this user.
atQueue="g"

# the name of the file containing the checks to run
checkFile="./checks"

# function to print a usage message and bail
usageAndBail()
{
   cat << EOT
Usage:guardian [OPTION]...
Run a guardian to watch over processes. Currently this supports apache and mysql. Other
processes can be added by simple modifications to the script. Invoking the guardian will run
an instance of this script every n minutes until the guardian is shutdown with the -t option.
Attempting to re-invoke a running guardian has no effect.

All activity (debug, warning, critical) is logged to the local0 facility on syslog.

The checks are listed in a checkfile, for example:

   #check type, daemonName, executableName, checkSource, checkParameters
   logfileCheck, apache2,       apache2,        /var/log/apache2/mainlog, "segmentation fault"

This checkfile specifies a periodic check of apache's mainlog for a string containing
"segmentation fault", restarting the apache2 process if it fails.

This script should be run on each host running the service(s) to be watched.

  -i        set the check interval to MINUTES
  -c        use the specified check file
  -b        batch mode. don't write to stderr ever
  -t        terminate the guardian
  -h        print this help

Examples:
To run a guardian every 10 minutes using checks in "./myCheckFile"
$ guardian -c ./myCheckFile -i 10

EOT

   exit 1;
}

# parse the command line arguments (l,s and t, each of which take a param)
while getopts i:c:hbt o
do     case "$o" in
        i)     checkInterval="$OPTARG";;
        c)     checkFile="$OPTARG";;
        h)     usageAndBail;;
        t)     terminateGuardian=true;;
        b)     batchMode=true;;        # never manually pass in this argument
        [?])   usageAndBail
       esac
done

# only output logging to standard error running from the command line
if test ${batchMode} = "false"
then
   loggerArgs="-s"
fi

# setup logging subsystem. using syslog via logger
logCritical="logger -t ${loggerTag} ${loggerArgs} -p local0.crit"
logWarning="logger -t ${loggerTag} ${loggerArgs} -p local0.warning"
logDebug="logger -t ${loggerTag} ${loggerArgs} -p local0.debug"

# delete all outstanding at jobs
deleteAllAtJobs ()
{
   for job in `atq -q ${atQueue} | cut -f1`
   do
      atrm ${job}
   done
}

# are we to terminate the guardian?
if test ${terminateGuardian} = "true"
then
   deleteAllAtJobs

   ${logDebug} "TERMINATING on user request"
   exit 0
fi

# check to see if a guardian job is already scheduled, return 0 if they are, 1 if not.
isGuardianAlreadyRunning ()
{
   # if there are one or more jobs running in our 'at' queue, then we are running
   numJobs=`atq -q ${atQueue} | wc -l`
   if test ${numJobs} -ge 1
   then
      return 0
   else
      return 1
   fi
}

# make sure that there isn't already an instance of the guardian running
# only do this for user initiated invocations.
if test ${batchMode} = "false"
then
   if isGuardianAlreadyRunning
   then
      ${logDebug} "guardian invoked but already running. doing nothing."
      exit 0
   fi
fi

# get the nth comma seperated token from the line, trimming whitespace
# usage getToken line tokenNum
getToken ()
{
   line=$1
   tokenNum=$2

   # get the nth comma seperated token from the line, removing whitespace
   token=`echo ${line} | cut -f${tokenNum} -d, | sed 's/^[ \t]*//;s/[ \t]*$//'`
}

# check http. get a page and look for a string in the result.
# usage: httpCheckImplementation sourceUrl checkString
httpCheck ()
{
   sourceUrl=$1
   checkString=$2

   wget -O - --quiet ${sourceUrl} | egrep -i "${checkString}" > /dev/null 2>&1
   httpCheckResult=$?
   if test ${httpCheckResult} -eq 0
   then
      ${logDebug} "PASS: found \"${checkString}\" in ${sourceUrl}"
   else
      ${logWarning} "FAIL: could NOT LOCATE \"${checkString}\" in ${sourceUrl}"
   fi

   return ${httpCheckResult}
}

# check to make sure that mysql is running
# usage: mysqlCheck connectString query
mysqlCheck ()
{
   connectString=$1
   query=$2

   # get the connect params from the connectString
   userAndPassword=`echo ${connectString} | sed "s/.*\/\/\(.*\)@.*/\1/"`
   mysqlUser=`echo ${userAndPassword} | cut -f1 -d:`
   mysqlPassword=`echo ${userAndPassword} | cut -f2 -d:`
   mysqlHost=`echo ${connectString} | sed "s/.*@\(.*\)\/.*/\1/"`
   mySqlDatabase=`echo ${connectString} | sed "s/.*@\(.*\)/\1/" | cut -f2 -d\/`

   mysql -e "${query}" --user=${mysqlUser} --host=${mysqlHost} --password=${mysqlPassword} --database=${mySqlDatabase} > /dev/null 2>&1
   mysqlCheckResult=$?
   if test ${mysqlCheckResult} -eq 0
   then
      ${logDebug} "PASS: executed \"${query}\" in ${mysqlHost}"
   else
      ${logWarning} "FAIL: could NOT EXECUTE \"${query}\" in database ${mySqlDatabase} on ${mysqlHost}"
   fi

   return ${mysqlCheckResult}
}

# check to make sure that a logfile is clean of critical errors
# usage: logfileCheck errorString logFile
logfileCheck ()
{
   logFile=$1
   errorString=$2
   logfileCheckResult=0
   marker="__guardian marker__"
   mark="${marker}: `date`"

   # make sure that the logfile exists
   test -r ${logFile} || { ${logCritical} "logfile (${logFile}) is not readable. CRITICAL GUARDIAN ERROR."; exit 1; }

   # see if we have a marker in the log file
   grep "${marker}" ${logFile} > /dev/null 2>&1
   if test $? -eq 1
   then
      # there is no marker, therefore we haven't seen this logfile before. add the
      # marker and consider this check passed
      echo ${mark} >> ${logFile}
      ${logDebug} "PASS: new logfile"
      return 0
   fi

   # pull out the "active" section of the logfile, i.e. the section between the
   # last run of the guardian and now i.e. betweeen the marker and the end of the file

   # get the last marker line number
   lastMarkerLineNumber=`grep -n "__guard" ${logFile} | cut -f1 -d: | tail -1`

   # grab the active section
   activeSection=`cat ${logFile} | sed -n "${lastMarkerLineNumber},$ p"`

   # check for the regexs the logFile's active section
   echo ${activeSection} | egrep -i "${errorString}" > /dev/null 2>&1
   if test $? -eq 1
   then
      ${logDebug} "PASS: logfile (${logFile}) clean: line ${lastMarkerLineNumber} to EOF"
   else
      ${logWarning} "FAIL: logfile (${logFile}) CONTAINS CRITICAL ERRORS"
      logfileCheckResult=1
   fi

   # mark the newly checked section of the file
   echo ${mark} >> ${logFile}

   return ${logfileCheckResult}
}

# restart deamon, not taking no for an answer
# usage: restartDamon executableName, initdName
restartDaemon ()
{
   executableName=$1
   initdName=$2
   restartScript="/etc/init.d/${initdName}"

   # make sure that the daemon executable is there
   test -x ${restartScript} || { ${logCritical} "restart script (${restartScript}) is not executable. CRITICAL GUARDIAN ERROR."; exit 1; }

   # try a polite stop
   ${restartScript} stop > /dev/null

   # get medieval on it's ass
   pkill -x ${executableName} ; sleep 2 ; pkill -9 -x ${executableName} ; sleep 2

   # restart the deamon
   ${restartScript} start > /dev/null

   if test $? -ne 0
   then
      ${logCritical} "failed to restart daemon (${executableName}): CRITICAL GUARDIAN ERROR."
      exit 1
   else
      ${logDebug} "daemon (${executableName}) restarted."
   fi
}

#
# things look good, let's do our checks and then schedule a new one
#

# make sure that the checkFile exists
test -r ${checkFile} || { ${logCritical} "checkfile (${checkFile}) is not readable. CRITICAL GUARDIAN ERROR."; exit 1; }

# loop through each of the daemons that need to be managed
for daemon in `cat ${checkFile} | egrep -v "^#.*" | cut -f2 -d, |  sed 's/^[ \t]*//;s/[ \t]*$//' | sort -u`
do
   # execute all the checks for the daemon in question
   cat ${checkFile} | egrep -v "^#.*" | while read line
   do
      getToken "${line}" 2 ; daemonName=${token}

      if test ${daemonName} = ${daemon}
      then
         # get the check definition
         getToken "${line}" 1 ; checkType=${token}
         getToken "${line}" 3 ; executableName=${token}
         getToken "${line}" 4 ; checkSource=${token}
         getToken "${line}" 5 ; checkParams=${token}

         # remove quotes
         checkSourceQuoteless=`echo ${checkSource} | sed "s/\"//g"`
         checkParamsQuoteless=`echo ${checkParams} | sed "s/\"//g"`

         # call the appropriate handler for the check
         ${checkType} "${checkSourceQuoteless}" "${checkParamsQuoteless}"

         if test $? -ne 0
         then
            ${logCritical} "CRITICAL PROBLEMS with deamon (${daemonName}), RESTARTING."
            restartDaemon ${executableName} ${daemonName}
         fi
      fi
   done
done

# delete all at jobs (race conditions)
deleteAllAtJobs

# schedule a new instance of this sucker
${logDebug} "scheduling another check to run in ${checkInterval} minutes"
at -q ${atQueue} now + ${checkInterval} minutes > /dev/null 2>&1 << EOT
$0 $* -b
EOT

tech blog

if you found this article useful, and you are interested in other articles on linux, drupal, scaling, performance and LAMP applications, consider subscribing to my technical blog.

Hi John: I noticed a few

Hi John:

I noticed a few posts here that talk about monitoring a system and handling actions on certain events (e.g. restarting a service, or sending out emails). What are your thoughts on using a solution like Nagios for conducting some of these tasks, and developing plugins for that. The literature out there seems to suggest that it is pretty powerful, though I'm sure it may be overkill for certain deployments. Do you have any insight on Nagios or any other alternatives?

Thanks for the posts!

thanks for bringing this

thanks for bringing this up.

to manage a site of any complexity, you really need a good monitoring solution. although i'm not in love with nagios, most people i've talked to agree that it's the best open-source monitoring solution available.

as you suggest, it's certainly possible to use nagios to implement a guardian. the event handler functionality is designed to allow you to do exactly that.

whether it's a good idea to do this, rather than use a separate light-weight guardian, isn't clear to me. i can think of pros and cons to each approach. i'd be interested to hear your experience if you give it a try ...

Since I'm starting this

Since I'm starting this deployment from scratch, the question is whether to build the monitoring scripts from ground up, or go for a 'heavier' solution and grow into it. Right now, for a bunch of our staging boxes, we only have cron watch, and now need something uhh.. better!

We have a test Nagios box set up and are now setting up other servers. Still have to dig into it, but it seems fairly light weight and not too complex to set up! Haven't set the guardian up yet. Will definitely keep you posted.

I did look at Zenoss and Zabbix, but I really didn't see the need to set up a db and manage and monitor that for a smaller deployment. The other offerings seemed to be much heavier weight, with auto-discovery, etc. Having cut my teeth in sys mgmt, I cringe at such a heavy weight solution for <20 boxes, including production!

Oh well... details to follow as things progress...

Thanks!

post new comment

the content of this field is kept private and will not be shown publicly.
  • web page addresses and e-mail addresses turn into links automatically.
  • allowed html tags: <h2> <h3> <h4> <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • lines and paragraphs break automatically.
  • you may post code using <code>...</code> (generic) or <?php ... ?> (highlighted php) tags.

more information about formatting options

captcha
are you human? we hope so.
copy the characters (respecting upper/lower case) from the image.