tech blogs
log4drupal now available on github
both the 5.x and 6.x versions are now available for download on github. sorry, i just can't do CVS anymore. to download:
- start by going here: http://github.com/cailinanne/log4drupal
- then click the
all tagsdrop-down and choose the appropriate version - then click the download button
a full description of the module is available here
- cailin's blog
- 1 comment
- read more
- 1293 reads
breadth-first graph search using an iterative map-reduce algorithm
i've noticed two trending topics in the tech world today: social graph manipulation and map-reduce algorithms. in the last blog, i gave a quickie guide to setting up hadoop, an open-source map-reduce implementation and an example of how to use hive - a sql-like database layer on top of that. while this is one reasonable use of map-reduce, this time we'll explore it's more algorithmic uses, while taking a glimpse at both of these trendy topics!
- cailin's blog
- 1 comment
- read more
- 3131 reads
san francisco nosql meetup
the meetup discussed the limitations of traditional relational database technology at scale and the open-source alternatives currently available with similar functionality to amazon's dynamo google's bigtable.
- john's blog
- add new comment
- read more
- 2469 reads
exploring apache log files using hive and hadoop
if you're exploring hive as at technology, and are looking to move beyond "hello, world", here's a little recipe for a simple but satisfying first task using hive and hadoop. we'll work through setting up a clustered installation of hive and hadoop, and then import an apache log file and query it using hive's SQL-like language.
unless you happen to have three physical linux servers at your disposal, you may want to create your base debian linux servers using a virtualization technology such as xen. for a good guide on setting up xen, go here. for the remainder of this tutorial, i'll assume that you have three debian (lenny) servers at your disposal.
let's get started
- cailin's blog
- add new comment
- read more
- 2861 reads
log4drupal - an updated logging api for drupal 6
drupal 6 included an upgrade to the built in logging functionality (watchdog). drupal 6 exposes a new hook, hook_watchdog which modules may implement to log Drupal events to custom destinations. it also includes two implementations, the dblog module which logs to the watchdog table, and the syslog module which logs to syslog.
with these upgrades, log4drupal is less critical addition to a drupal install, and i hesitated before providing a drupal 6 upgrade. however, eventually i decided that log4drupal is still a useful addition to a drupal development environment as log4drupal provides the following features still not provided by the upgraded drupal 6 watchdog implementation :
- a java-style stacktrace including file and line numbers, showing the path of execution
- automatic recursive printing of all variables passed to the log methods
- ability to change the logging level on the fly
in addition, the drupal 6 version of log4drupal includes the following upgrades from the drupal 5 version
- all messages sent to the watchdog method are also output via log4drupal
- severity levels have been expanded to confirm to RFC 3164
- log module now loaded during the drupal bootstrap phase so that messages may be added within
hook_bootimplementations.
you may download the drupal 6 version here. see below for general information on what this module is about and how it works.
- cailin's blog
- 3 comments
- read more
- 3067 reads
easy-peasy-lemon-squeezy drupal 6 installation on debian linux
installing drupal is pretty easy, but it's even easier if you have a step by step guide. i've written one that will produce a basic working configuration with drupal6 on debian lenny with php5, mysql5 and apache2.
all commands that follow assume that you are the root user.
let's get started!
- cailin's blog
- 4 comments
- read more
- 4941 reads
smartly purge your old backup files on linux
if you backup your *nix box, eventually you'll get into the business of purging your old backup files to preserve disk space. a reasonable way to do this is to use the find command to identify old backups and delete them. you should, however, consider doing something a little smarter than this.
- john's blog
- 4 comments
- read more
- 2840 reads
using google analytics advanced segments to separate direct and organic traffic
traffic to a website can be divided into four major sources : direct, paid, organic and referrals. unsurprisingly, google analytics segments the traffic sources reports accordingly.
there is, however, a small catch. the ever growing popularity of search engines has led to an odd use case : users who use a search engine to search for exactly your domain name, instead of simply typing www.mydomain.com into their web browser. these users have just reached your site via an "organic search" and google analytics will classify them accordingly.
technically this is correct, but semantically it's troubling. the users who have reached your site by typing "mydomain" into Google have far more in common with the users that entered www.mydomain.com into their URL bar and far less in common with those users that reached your site by typing "my optimized search term" into Google. and the population of these users is not small - on one of the commercial drupal sites that i maintain these "mydomain" Google searchers account for over one third of the supposedly organic traffic.
before the release of google analytics advanced segments, one could estimate the volume of "True Organic" pageviews by starting with the organic search volume, then using the keyword report to subtract all the "mydomain" keywords (mydomain, mydomain.com, and, my personal favorite www.mydomain.com).
thankfully, advanced segments now gives us an easy way to create a "True Direct" and "True Organic" segment - in which all the "mydomain" organic searches have been removed from the organic segment, and stuck in the direct segment instead.
- cailin's blog
- 7 comments
- read more
- 3147 reads
stokereport.com : drupal powered web 2.0 site for surfers
recently launched, stokereport.com is starting to make waves in the san francisco surfing community, as the first san francisco surf report website powered by user-generated content
powered by drupal 5.3 under the hood, stokereport is web 2.0 to the core. all content is user-generated, and users may submit reports via SMS, Twitter, mobile web or a traditional web browser. users may post pictures with their report, and vote for their favourites. this feature that has quickly led to a great collection of san francisco surf pics
stokereport is also a bit of a "mash-up" - combining data from the national weather service, weather underground, noaa and other regional weather services to provide current and forecast conditions for swell, wind and temperature.
and finally, if you can't quite get motivated to get in the water yourself, but still like to dream, check out stokereport's user-submitted "rants" - a great collection of news, videos and offbeat fun from the world of surfing.
- cailin's blog
- 8 comments
- 2672 reads
what is twitter and why should you care
an unfathomable number of people around the world are hooked on a new(ish) service called twitter, and an equally unfathomable number still have no idea what it is. twitter is . . . a bit hard to explain. one way to think of twitter is as a blog hosting website. however, there is one twist : each entry (called a "tweet") may be no longer than 140 characters. and two unique features, 1) you can send in your blog updates by sending a text message (SMS) to twitter, and 2) your friends can sign up to receive your blog updates on their phones.
to get a better idea what i'm talking about, check out the twitter feed for my crazy husband, or an important person like barack obama. or, check out a few of the many twitter visualizations. twittervision shows some very small percentage of all the tweets received, and where they are coming from. it's best to look at this at an hour of the day when asia is asleep. i also like twistori which shows all incoming tweets containing certain keywords like "wish".
tweets started out in plain text, but it didn't take long for folks to think . . . gosh . .. i'd love to include a snapshot with my random thought of the day . . .and hence was born twitpic. and, of course, there are lots of handy applications to send photo-enabled twitters from your cellphone. i like twitterlator.
how you might use twitter depends on who you are. if you are . . .
- incapable of sending an SMS, or don't know what that is :
forget it, stay away. - capable of sending an SMS, but too lazy to setup a blog
twitter's a great way to join the nation's new favorite pastime - generating as much useless information as quickly as possible. - a non-technical blogger
twitter is a great companion to a traditional blog. if you're blogging using a standard blogging technology (wordpress, blogger, etc.) then you can easily add your twitter micro-blog as a sidebar to your regular blog. it's easy, it's fun, and it keeps your blog "fresh" with little effort on your part. - a geek
if you have any geeky tendencies, you'll likely rapidly develop a love/hate relationship with twitter. love the platform, hate the implementation. at the very least, you'll capture your tweets and display them (sensibly!) as part of your blog. or, hell, you might write an entire surf conditions report website that uses twitter as its underlying technology.
- cailin's blog
- 6 comments
- 1286 reads
amazon release their elastic block store, ebs
a while ago i posted some performance benchmarks for drupal running on a variety of servers in amazon's elastic compute cloud.
amazon have just released ebs, the final piece of technology that makes their ec2 platform really viable for running lamp stacks stuck as drupal.
ebs, the "elastic block store", provides sophisticated storage for your database instance, with features including:
- high io throughput
- data replication
- large storage capacity
- hot backups using snapshots
- instance type portability e.g. quickly swapping your database hardware for a bigger machine.
- john's blog
- 4 comments
- read more
- 4865 reads
a new jmeter book from packt
recently i posted a couple of introductory articles on jmeter, a great apache open-source tool that allows you to measure the performance and scalability of a wide variety of services, especially web-applications.
i wrote these articles because although the online documentation provides reasonable reference material, it doesn't serve well as a jmeter introduction or tutorial.
things have changed a bit since then. the uk-based publishing house packt publishing were kind enough to send me a copy of emily halili's newly published book on jmeter, which is as far as i can tell, is the first book dedicated to the subject.
- john's blog
- add new comment
- read more
- 3129 reads
lamp on amazon ec2 shaping up nicely
recently i posted some encouraging performance benchmarks for drupal running on a variety of servers in amazon's elastic compute cloud. while the performance was encouraging, the suitability of this environment for running lamp stacks was not. ec2 had some fundamental issues including a lack of static ip addresses and no viable persistent storage mechanism.
amazon are quickly rectifying these problems, and recently announced elasic ip addresses; a "static" ip address that you own and can dynamically point at any of your instances.
today amazon indicated that persistent storage will soon be available.
- john's blog
- 7 comments
- read more
- 5988 reads
zicasso launches drupal-powered web2.0 travel site
three weeks ago, zicasso.com launched a drupal-powered free personalized online travel service that aims to connect travelers to a global network of quality, pre-screened travel companies. unlike many internet travel sites which provide cheap fares or packages, zicasso is targeted for busy, discerning travelers who want to plan and book complex trips (the ones with multiple destination stops or activities).
zicasso was favorably reviewed in popular web publications including; pc magazine, techcrunch, ars technica and the san jose business journal.
zicasso chose to build their application using the open-source cms system, drupal to leverage the wide array of web2.0 functionality provided by the open source community.
the application was rapidly constructed by a small development team led by cailin nelson and jenny dickinson. the team took advantage of "core" drupal modules including cck, panels, views, imagecache, workflow and actions.
- john's blog
- 5 comments
- read more
- 5352 reads
backing up your xen domains
backups are boring, but we all know how important they are. backups can also be quite powerful when working with xen virtualization, since xen allows for convenient back-up and restore of entire systems.
i've recently been working on a flexible, general-purpose script enabling incremental backups of complete xen guests, optimized for secure, distributed environments; xenBackup. if you're working with xen, you might find it useful.
the xenBackup script leverages open-source components like ssh, rsync, and rdiff-backup to create a simple, efficient and functional solution.
- john's blog
- 23 comments
- read more
- 27616 reads
lamp performance on the elastic compute cloud: benchmarking drupal on amazon ec2
amazon's elastic compute cloud, "ec2", provides a flexible and scalable hosting option for applications. while ec2 is not inherently suited for running application stacks with relational databases such as lamp, it does provide many advantages over traditional hosting solutions.
in this article we get a sense of lamp performance on ec2 by running a series of benchmarks on the drupal cms system. these benchmarks establish read throughput numbers for logged-in and logged-out users, for each of amazon's hardware classes.
we also look at op-code caching, and gauge it's performance benefit in cpu-bound lamp deployments.
- john's blog
- 13 comments
- read more
- 14151 reads
load test your drupal application scalability with apache jmeter: part two
i recently posted an introductory article on using jmeter to load test your drupal application. if you've read this article and are curious about how to build a more sophisticated test that mimics realistic load on your site, read on.
the previous article showed you how to set up jmeter and create a basic test. to produce a more realistic test you should simulate "real world" use of your site. this typically involves simulating logged-in and logged-out users browsing and creating content. jmeter has some great functionality to help you do this.
- john's blog
- 14 comments
- read more
- 12051 reads
load test your drupal application scalability with apache jmeter
there are many things that you can do to improve your drupal application's scalability, some of which we discussed in the recent scaling drupal - an open-source infrastructure for high-traffic drupal sites article.
when making scalability modifications to your system, it's important to quantify their effect, since some changes may have no effect or even decrease your scalability. the value of advertised scalability techniques often depends greatly on your particular application and network infrastructure, sometimes creating additional complexity with little benefit.
apache jmeter is a great tool to simulate load on your system and measure performance under that load. in this article, i demonstrate how to setup a testing environment, create a simple test and evaluate the results.
- john's blog
- 4 comments
- read more
- 36212 reads
how to setup real-time email-notification for critical syslog events
it is often important for system administrators to get real time notification of critical events. unfortunately, it isn't immediately obvious how to do this in the syslog framework. in this article i show you step-by-step how to do this.
- john's blog
- 6 comments
- read more
- 12493 reads
supercharge your css code with m4
one of the biggest problems is the lack of constants. how many times have you wanted to code something like this? light_grey = #CCC. instead you are forced to repeat #CCC in your css. this quickly creates difficult-to-maintain and difficult-to-read code.
an elegant solution to the problem is to use a general purpose preprocessor like m4. m4 gives you a full range of preprocessing capability, from simple constants to sophisticated macros.
- john's blog
- 8 comments
- read more
- 3856 reads
using a guardian to ensure your lamp site is always up
to guarantee maximum uptime for your site, it's a good idea to periodically check the health of your system and restart failing components. you can use a simple program to do this automatically. i like to call this type of program, a "guardian".
clearly guardians shouldn't be used as a crutch for a badly configured system. used appropriately, however, they can decrease downtime due to unexpected events or administrator-error.
in this article, i describe how to implement, install and configure a guardian using a lightweight bash script. i go on to describe how to watch over your lamp install using this guardian. please note that all code and configurations have been tested on debian etch but should be useful for other *nix flavors with subtle modifications.
- john's blog
- 3 comments
- read more
- 4889 reads
cck witch - multi-page cck forms for drupal
the blessing and curse of cck is the ability to quickly create very complex node types within drupal. it doesn't take very long before the input form for a complex node type has become unmanageably long, requiring your user to do a lot of scrolling to get to the bottom of the form. the obvious solution is to break your form into multiple pages, but there is no easy way to do this. there do exist two proposed solutions to this, the cck wizard module and a drupal handbook entry. however, the well-intentioned cck wizard module doesn't seem to work, and the example code in the drupal handbook becomes tedious to repeat for each content type. to fill the void, i bring you cck witch
cck witch is based on the same premise as the handbook entry : the most natural way to divide a cck form into pages is to use field groups. from there, however, cck witch diverges, taking a relatively lazy, yet effective approach to the problem of multi page forms: on every page we render the entire form, but then simply hide the fields and errors that do not belong to the current step. it also offers an additional feature : when the form is complete and the node is rendered, an individual edit link is provided for each step - allowing the user to update the information only for a particular page in the form, without having to step through the entire wizard again.
if you've now read enough to be curious to see the goods, then please, be my guest and skip straight to the live demo.
- cailin's blog
- 13 comments
- read more
- 13804 reads
never lose your data again: backup remotely using rsync ssh and rdiff-backup
if you've ever lost precious data after a hard drive failure, you've probably learned your lesson and are now automatically backing up your system.
your treasured pictures, videos and documents may still be at risk. your computer could be stolen, destroyed by flood or fire or chopped into small pieces by a jealous ex-lover.
using a remote backup service is a good way to mitigate against this type of problem. for around $10 a month, you can find companies willing to store 10Gb of data for you. your data is usually accessible using a variety of methods, including rsync, vpn and ftp. to see some of these services, type remote backup rsync service into google.
in this article, i discuss using open source software to take advantage of these services in an efficient and secure manner, allowing the backup of large directories over a dsl-speed line while you sleep.
- john's blog
- read more
- 4536 reads
the fantastic four - drupals unofficial core
using the term "content management system" to describe the drupal cms understates it's full potential. i prefer to consider drupal a web-application development-system, particularly suitable for content-heavy projects.
what are the fantastic four?
drupal's application development potential is provided in large-part by a set of "core" modules that dovetail to provide an application platform that other modules and applications build on. these modules have become a de-facto standard: drupal's fantastic four. our superheros are cck, views, panels and cck field types and widgets. if you are considering using drupal to build a website of any sophistication, you can't overlook these.
- john's blog
- 7 comments
- read more
- 18669 reads
scaling drupal step four - database segmentation using mysql proxy
if you've setup a clustered drupal deployment (see scaling drupal step three - using heartbeat to implement a redundant load balancer), a good next-step, is to scale your database tier.
in this article i discuss scaling the database tier up and out. i compare database optimization and different database clustering techniques. i go on to explore the idea of database segmentation as a possibility for moderate drupal scaling. as usual, my examples are for apache2, mysql5 and drupal5 on debian etch. see the scalability overview for related articles.
- john's blog
- 10 comments
- read more
- 24114 reads
log4drupal - a logging api for drupal
UPDATE: for the drupal 6 version, please go here.
if your career as a developer has included a stay in the j2ee world, then when you arrived at drupal one of your initial questions was "where's the log file?". eventually, someone told you about the watchdog table. you decided to try that for about five minutes, and then were reduced to using a combination of <pre> and print_r to scrawl debug data across your web browser.
when you tired of that, you learned a little php, did a little web research and discovered the PEAR log package and debug_backtrace(). the former is comfortably reminiscent of good old log4j and the latter finally gave you the stacktrace you'd been yearning for. still, separately, neither gave you quite what you were looking for : a log file in which every entry includes the filename and line number from which the log message originated. put them together though, and you've got log4drupal
log4drupal is a simple api that writes messages to a log file. each message is tagged with a particular log priority level (debug, info, warn, error or emergency) and you may also set the overall log threshold for your system. only messages with a priority level above your system threshold are actually printed to your log file. the system threshold may changed at any time, using the log4drupal administrative interface. you may also specify whether or not a full stack trace is included with every message. by default, a stack trace is included for messages with a priority of error and above. the administrative options are illustrated below :
- cailin's blog
- 15 comments
- read more
- 8457 reads
scaling drupal step one B - nfs vs rsync
i got some good feedback on my dedicated data server step towards scaling. kris buytaert in his everything is a freaking dns problem blog points out that nfs creates an unnecessary choke point. he may very well have a point.
having said that, i have run the suggested configuration in a multi-web-server, high-traffic production setting for 6 months without a glitch, and feedback on his blog gives example of other large sites doing the same thing. for even larger configurations, or if you just prefer, you might consider another method of synchronizing files between your web servers.
- john's blog
- 9 comments
- read more
- 9464 reads
beef up your drupal security with apache mod_rewrite and SSH
install.php, you were right. your bum was hanging squarely out of the window, and you should probably consider beefing up your security.
drupal's default exposure of files like install.php and cron.php present inherent security risks, for both denial-of-service and intrusion. combine this with critical administrative functionality available to the world, protected only by user defined passwords, broadcast over the internet in clear-text, and you've got potential for some real problems.
- john's blog
- 3 comments
- read more
- 4887 reads
better css for the drupal hovertip module
don't get me wrong, i'm a happy customer of the drupal hovertip module. everything worked out of the box, and i've enjoyed using it to cram even more pictures into my website. however, the included default css leaves a little to be desired for the following reasons :
- it's too specific. it assigns a very particular look and feel to your tooltips, complete with background colors, fixed widths and font sizes. sure, in theory, you can override all that in your theme css. but if css specificity is not your thing, you're going to be tearing your hair out trying to figure how to do it.
- the ui element chosen to indicate "hover here" is non-standard. the "hover here" directive is admittedly fairly new, but the emerging standard seems to be the dashed-underline (certainly not the italic font used in the drupal hovertip module).
- the clicktip css does not work on ie6. the link to close the clicktip has mysteriously gone missing.
you can download a more generic, flexible version of the necessary hovertip module css that solves all these issues here. here are some examples of how to use it.
- cailin's blog
- read more
- 4001 reads
scaling drupal - an open-source infrastructure for high-traffic drupal sites
the authors of drupal have paid considerable attention to performance and scalability. consequently even a default install running on modest hardware can easily handle the demands of a small website. my four year old pc in my garage running a full lamp install, will happily serve up 50,000 page views in a day, providing solid end-user performance without breaking a sweat.when the times comes for scalability. moving of of the garage
if you are lucky, eventually the time comes when you need to service more users than your system can handle. your initial steps should clearly focus on getting the most out of the built-in drupal optimization functionality, considering drupal performance modules, optimizing your php (including considering op-code caching) and working on database performance. John VanDyk and Matt Westgate have an excellent chapter on this subject in their new book, "pro drupal development"once these steps are exhausted, inevitability you'll start looking at your hardware and network deployment.
- john's blog
- 19 comments
- read more
- 38568 reads





