exploring apache log files using hive and hadoop

if you're exploring hive as at technology, and are looking to move beyond "hello, world", here's a little recipe for a simple but satisfying first task using hive and hadoop. we'll work through setting up a clustered installation of hive and hadoop, and then import an apache log file and query it using hive's SQL-like language.

unless you happen to have three physical linux servers at your disposal, you may want to create your base debian linux servers using a virtualization technology such as xen. for a good guide on setting up xen, go here. for the remainder of this tutorial, i'll assume that you have three debian (lenny) servers at your disposal.

let's get started

setting up hadoop

first, we need to set up hadoop. we're going to install hadoop at /var/hadoop. execute the following commands as root :

# apt-get install sun-java6-jre
# cd /var
# wget http://www.bizdirusa.com/mirrors/apache/hadoop/core/stable/hadoop-0.18.3.tar.gz
# tar -xvf hadoop-0.18.3.tar.gz
# mv hadoop-0.18.3 hadoop
# rm hadoop-0.18.3.tar.gz

now, vi conf/hadoop-env.sh and set the JAVA_HOME variable appropriately. additionally, if you want to run hadoop as a different user, change the hadoop directory permissions appropriately. for example :

# chgrp -R cailin.cailin /var/hadoop

repeat this section for all three of your hadoop servers.

configuring hadoop

first, edit /etc/hosts on all three servers and make sure that they are all aware of each other's existence. for example, my servers are named haddop1, haddop2 and haddop3 and my /etc/hosts looks like this :

192.168.1.20    haddop1
192.168.1.21    haddop2
192.168.1.22    haddop3

now make sure that you have password-free SSH access between all servers.

finally, on all three servers, modify /var/hadoop/hadoop-sites.xml to contain the following

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://haddop1:9000</value>
  </property>

<property>
  <name>dfs.datanode.address</name>
  <value>haddop3:50090</value>
</property>

<property>
  <name>dfs.datanode.http.address</name>
  <value>haddop3:50075</value>
</property>

  <property>
    <name>mapred.job.tracker</name>
    <value>hdfs://haddop2:9001</value>
  </property>

  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

and /var/hadoop/conf/slaves to contain

haddop3

starting up hadoop

in the previous configuration, we specific that haddop1 was our !NameNode, haddop2 was our !JobTracker and haddop3 was both a DataNode and TaskTracker.

on the !NameNode, issue the command to format a new distributed file system (HDFS) and then startup the HDSF.

$ bin/hadoop namenode -format
$ bin/start-dfs.sh

on the JobTracker start up MapReduce :

bin/start-mapred.sh

testing hadoop (optional)

if you're like me, even though our ultimate goal here is hive . . . you can't proceed until you've double checked to make sure that everything works so far. okay, fine, we'll make a short stop to test hadoop. all commands should be executed as your non-root user.

first, make yourself some test data. create a directory in your home directory and put in two or more text files containing a few lines of text each. i created mine at /home/cailin/input/wordcount

On your JobTracker node copy some example data into the HDFS with HDFS directory name input-wordcount

bin/hadoop dfs -copyFromLocal /home/cailin/input/wordcount input-wordcount

Now, kick off a job to count the number of instances of each word in your input files. Put the output of the job in an HDFS directory with name output-wordcount

$ bin/hadoop jar hadoop-0.18.4-dev-examples.jar wordcount input-wordcount output-wordcount

Now, copy the input out of HDFS to your local filesystem and take a look

$ bin/hadoop dfs -get output-wordcount /home/cailin/output/wordcount
$ cat /home/cailin/output/wordcount/*

If you want to run the example again, you need to clean up after your first run. Delete the output directory from the HDFS and your local copy too. Also, delete the input HDFS directory.

$ bin/hadoop dfs -rmr output-wordcount
$ rm -r /home/cailin/output/wordcount
$ bin/hadoop dfs -rmr input-wordcount

shutting down hadoop

it's probably best if you shut-down hadoop while we're setting up hive. on NameNode

$ bin/stop-dfs.sh

on the JobTracker

$ bin/stop-mapred.sh

installing hive

on each of the three servers, execute the following as root

   # mkdir /var/hive
   # cd /tmp
   # svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive
   #  cd hive
   #  ant -Dhadoop.version="0.18.3" package
   #  cd /build/dist
   # mv * /var/hive/.

and, if you want to run hive as somebody other than root, through in something like the following :

# sudo chown -R cailin.cailin /var/hive

now, vi /etc/profile and make it aware of the following three environment variables, changing the value of JAVA_HOME as appropriate.

export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/var/hadoop
export HIVE_HOME=/var/hive

logout and login to all three servers to "activate" the changes in /etc/profile

finally, on the NameNode, as your non-root user, execute the following commands to create some necessary directories. (if you get an error indicating that the directory already exists, that's okay.

$ bin/hadoop fs -mkdir /tmp
$ bin/hadoop fs -mkdir /user/hive/warehouse
$ bin/hadoop fs -chmod g+w /tmp
$ bin/hadoop fs -chmod g+w /user/hive/warehouse

testing hive

first, start up hadoop again, following the instructions above.

now, as the non-root user on your JobTracker start up the hive command line interface (cli)

$ cd /var/hive
$ bin/hive

now, in the hive cli, execute the following trivial series of commands, just to make sure everything is in working order

hive> CREATE TABLE pokes (foo INT, bar STRING);
hive> SHOW TABLES;
hive> DROP TABLE pokes;

exploring an apache log file using hive

finally, we're able to get to the point!

first, copy an apache log file to /tmp/apache.log on the JobTracker

still on your JobTracker in the hive cli, create the table and load in the data

hive> CREATE TABLE apachelog (
ipaddress STRING, identd STRING, user STRING,finishtime STRING,
requestline string, returncode INT, size INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe'
WITH SERDEPROPERTIES (
'serialization.format'='org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol',
'quote.delim'='("|\\[|\\])',
'field.delim'=' ',
'serialization.null.format'='-')
STORED AS TEXTFILE;

hive> LOAD DATA LOCAL INPATH '/tmp/apache.log' OVERWRITE INTO TABLE apachelog;

yee hoo! your apache log file is now available to query using the Hive QL. try a few simple queries like :

SELECT * from apachelog  WHERE ipaddress = '216.211.123.184';

now, suppose we want to execute a simple task such as determining the biggest apache offender (e.g. the ip address with the most apache requests.) in MySQL we would write something like SELECT ipaddress, COUNT(1) AS numrequest FROM ipaddress GROUP BY ipaddress ORDER BY numrequest DESC LIMIT 1. the closest approximation to this in HIVE QL is

hive> SELECT ipaddress, COUNT(1) AS numrequest FROM apachelog GROUP BY ipaddress SORT BY numrequest DESC LIMIT 1;

however, this may not give you the answer you were expecting! in hive, the SORT BY command indicates only that the data is sorted within a reducer. so, to enforce a global sort over all data, you would have to set the number of reducers to 1. this may not be realistic for a large data set. to see what happens with more than one reducer, force the number of reducers to 2.

hive> set mapred.reduce.tasks=2;
hive> SELECT ipaddress, COUNT(1) AS numrequest FROM apachelog GROUP BY ipaddress SORT BY numrequest DESC LIMIT 1;

to get the right answer, with greater than one reducer, it is necessary to first create a temporary table and populate it with the number of requests per ip address.

CREATE TABLE ipsummary (ipaddress STRING, numrequest INT);
INSERT OVERWRITE TABLE ipsummary SELECT ipaddress, COUNT(1) FROM apachelog GROUP BY ipaddress;

now, with the use of a Hive QL subquery, we can extract the information we seek, even with > 1 reducer.

SELECT ipsummary.ipaddress, ipsummary.numrequest FROM (SELECT MAX(numrequest) AS themax FROM ipsummary) ipsummarymax JOIN ipsummary ON ipsummarymax.themax = ipsummary.numrequest;

tee hee. that was alot of work. next time, just use google analytics.

This is a great tutorial,

This is a great tutorial, but I want to point out that DynamicSerDe has been deprecated, and that folks have been experiencing problems using the TCTLSeparatedProtocol to process weblogs (see http://osdir.com/ml/hive-user-hadoop-apache/2009-10/msg00062.html).

Please use the RegexSerDe instead.

See HIVE-662 for more information: http://issues.apache.org/jira/browse/HIVE-662

post new comment

the content of this field is kept private and will not be shown publicly.
  • web page addresses and e-mail addresses turn into links automatically.
  • allowed html tags: <h2> <h3> <h4> <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • lines and paragraphs break automatically.
  • you may post code using <code>...</code> (generic) or <?php ... ?> (highlighted php) tags.

more information about formatting options

captcha
are you human? we hope so.
copy the characters (respecting upper/lower case) from the image.