cailin's blog

welcome to cailin nelson's blog. there's more about me on our about us page. if you're more interested in the ladybug, try ava's blog instead.

don't hesitate to contact me about anything.



breadth-first graph search using an iterative map-reduce algorithm

i've noticed two trending topics in the tech world today: social graph manipulation and map-reduce algorithms. in the last blog, i gave a quickie guide to setting up hadoop, an open-source map-reduce implementation and an example of how to use hive - a sql-like database layer on top of that. while this is one reasonable use of map-reduce, this time we'll explore it's more algorithmic uses, while taking a glimpse at both of these trendy topics!

exploring apache log files using hive and hadoop

if you're exploring hive as at technology, and are looking to move beyond "hello, world", here's a little recipe for a simple but satisfying first task using hive and hadoop. we'll work through setting up a clustered installation of hive and hadoop, and then import an apache log file and query it using hive's SQL-like language.

unless you happen to have three physical linux servers at your disposal, you may want to create your base debian linux servers using a virtualization technology such as xen. for a good guide on setting up xen, go here. for the remainder of this tutorial, i'll assume that you have three debian (lenny) servers at your disposal.

let's get started

log4drupal - an updated logging api for drupal 6

drupal 6 included an upgrade to the built in logging functionality (watchdog). drupal 6 exposes a new hook, hook_watchdog which modules may implement to log Drupal events to custom destinations. it also includes two implementations, the dblog module which logs to the watchdog table, and the syslog module which logs to syslog.

with these upgrades, log4drupal is less critical addition to a drupal install, and i hesitated before providing a drupal 6 upgrade. however, eventually i decided that log4drupal is still a useful addition to a drupal development environment as log4drupal provides the following features still not provided by the upgraded drupal 6 watchdog implementation :

  • a java-style stacktrace including file and line numbers, showing the path of execution
  • automatic recursive printing of all variables passed to the log methods
  • ability to change the logging level on the fly

in addition, the drupal 6 version of log4drupal includes the following upgrades from the drupal 5 version

  • all messages sent to the watchdog method are also output via log4drupal
  • severity levels have been expanded to confirm to RFC 3164
  • log module now loaded during the drupal bootstrap phase so that messages may be added within hook_boot implementations.

you may download the drupal 6 version here. see below for general information on what this module is about and how it works.

easy-peasy-lemon-squeezy drupal 6 installation on debian linux

installing drupal is pretty easy, but it's even easier if you have a step by step guide. i've written one that will produce a basic working configuration with drupal6 on debian lenny with php5, mysql5 and apache2.

all commands that follow assume that you are the root user.

let's get started!

using google analytics advanced segments to separate direct and organic traffic

traffic to a website can be divided into four major sources : direct, paid, organic and referrals. unsurprisingly, google analytics segments the traffic sources reports accordingly.

there is, however, a small catch. the ever growing popularity of search engines has led to an odd use case : users who use a search engine to search for exactly your domain name, instead of simply typing www.mydomain.com into their web browser. these users have just reached your site via an "organic search" and google analytics will classify them accordingly.

technically this is correct, but semantically it's troubling. the users who have reached your site by typing "mydomain" into Google have far more in common with the users that entered www.mydomain.com into their URL bar and far less in common with those users that reached your site by typing "my optimized search term" into Google. and the population of these users is not small - on one of the commercial drupal sites that i maintain these "mydomain" Google searchers account for over one third of the supposedly organic traffic.

before the release of google analytics advanced segments, one could estimate the volume of "True Organic" pageviews by starting with the organic search volume, then using the keyword report to subtract all the "mydomain" keywords (mydomain, mydomain.com, and, my personal favorite www.mydomain.com).

thankfully, advanced segments now gives us an easy way to create a "True Direct" and "True Organic" segment - in which all the "mydomain" organic searches have been removed from the organic segment, and stuck in the direct segment instead.

stokereport.com : drupal powered web 2.0 site for surfers

recently launched, stokereport.com is starting to make waves in the san francisco surfing community, as the first san francisco surf report website powered by user-generated content

powered by drupal 5.3 under the hood, stokereport is web 2.0 to the core. all content is user-generated, and users may submit reports via SMS, Twitter, mobile web or a traditional web browser. users may post pictures with their report, and vote for their favourites. this feature that has quickly led to a great collection of san francisco surf pics

stokereport is also a bit of a "mash-up" - combining data from the national weather service, weather underground, noaa and other regional weather services to provide current and forecast conditions for swell, wind and temperature.

and finally, if you can't quite get motivated to get in the water yourself, but still like to dream, check out stokereport's user-submitted "rants" - a great collection of news, videos and offbeat fun from the world of surfing.

kindergarten search

okay, yes, ava is barely two, but yes, i admit it, i'm (vaguely) fretting about kindergarten already. today i attended a talk on the san francisco public school enrollment process put on by the parents for public school's organization. for those of you without preschool age children in san francisco, the reason for all the fretting about public school's in san francisco is that the district has a "parent choice" system of assigning kids to schools. you get no guarantee that you get to attend a school in your own neighborhood. you get to express some preference in the matter, but essentially it's just a lottery system that flings kids here and there randomly across the city. now let's be clear . . . san francisco is not a small city. it takes about 45 minutes to drive from the east side of town to the west and the city provides no school buses to ferry all these children to and fro.

one of the speaker's at the talk, dr. adams dudley, was very amusing. obviously a very intelligent and analytical fellow, he examined all the "no child left behind" data kindly supplied by the bush administration, and decided that it matter not a whit where or how your child was educated (public vs private, "good school" vs "bad school", language immersion vs straight english) - the only thing that matters is the parent's own socio-economic status and educational background.

i don't think that there is any debate that the numbers are on his side, but what surprised me was the fact that he was also a supporter of the current "parent choice" system. seems like an obvious contradiction to me. if your child's chance of success is baked into their socio-economic-dna and has nothing to do with which school they attend, why not give up this failed experiment, restore our neighborhood schools and all save a little on gas?

syndicate content