scaling drupal step four - database segmentation using mysql proxy

if you've setup a clustered drupal deployment (see scaling drupal step three - using heartbeat to implement a redundant load balancer), a good next-step, is to scale your database tier.

in this article i discuss scaling the database tier up and out. i compare database optimization and different database clustering techniques. i go on to explore the idea of database segmentation as a possibility for moderate drupal scaling. as usual, my examples are for apache2, mysql5 and drupal5 on debian etch. see the scalability overview for related articles.

deployment overview

this table summaries the characteristics of this deployment choice
scalability: good
redundancy: fair
ease of setup: poor

servers

in this example, i use:

web serverdrupal-lb1.mydomain.com192.168.1.24
data serverdrupal-data-server1.mydomain.com192.168.1.26
data serverdrupal-data-server2.mydomain.com192.168.1.27
data serverdrupal-data-server3.mydomain.com192.168.1.28
mysql load balancermysql-balance-1.mydomain.com192.168.1.94

first steps first - optimizing your database and application

the first step to scaling your database tier should include identifying problem queries (those taking most of the resources), and optimizing them. optimizing may mean reducing the volume of the queries by modifying your application, or increasing their performance using standard database optimization techniques such as building appropriate indexes. the devel module is a great way to find problem queries and functions.

another important consideration is the optimization of the database itself, by enabling and optimizing the query cache, tuning database parameters such as the maximum number of connections etc. using appropriate hardware for your database is also a huge factor in database performance, especially the disk io system. a large raid 1+0 array for example, may do wonders for your throughput, especially combined with a generous amount of system memory available for disk caching. for more on mysql optimization, take a look at the great o'reilly book by jeremy zawodny and derek balling on high performance mysql.

when it's time to scale out rather than up

you can only (and should only) go so far scaling up. at some point you need to scale out. ideally, you want a database clustering solution that allows you do exactly that. that is, add nodes to your database tier, completely transparently to your application, giving you linear scalability gains with each additional node. mysql cluster promises exactly this. it doesn't offer full transparency however, due to limitations introduced by the ndb storage engine required by mysql cluster. having said that, the technology looks extremely promising and i'm interested if anyone has got a drupal application running successfully on this platform. you can read more on mysql clustering on the mysql cluster website or in the the mysql clustering book by alex davies and harrison fisk.

less glamorous alternatives to mysql cluster

without the magic of mysql cluster, we've still got some, admittedly less glamorous, alternatives. one is to use traditional mysql database cluster, where all writes go to a single master and reads are distributed across several read-only-nodes. the master updates the read-only-nodes using replication.

an alternative is to segment read and write requests by role, thereby partitioning the data into segments, each one resident on a dedicated database.

these two approaches are illustrated below:

there are some significant pitfalls to both approaches:

  • the traditional clustering approach, introduces a replication lag i.e. it takes a non-trivial amount of time, especially under load, for writes to make it back to the read-only-nodes. this may not be problematic for very specific applications, but is problematic in the general case
  • the traditional clustering approach scales only reads, not writes, since each write has to be made to each node.
  • in traditional clustering the total effective size of your memory cache is the size of a single node (since the same data is cached on each node), whereas with segmentation it's the sum of the nodes.
  • in traditional clustering each node has the same hardware optimization pattern, whereas with segmentation, it can be customized according to the role it's playing.
  • the segmentation approach reduces the redundancy of the system, since theoretically a failure of any of the nodes takes your "database" off line. in practice, you may have segments that are non essential e.g. logging. you can, of course, cluster your segments, but this introduces the replication lag issue.
  • the segmentation approach relies on a thorough understanding of the application, and the relative projected load on each segment to do properly.
  • the segmentation approach is fundamentally very limited, since there are a limited number of segments for a typical application.

more thoughts on database segmentation

from one perspective, the use of memcache is a database segmentation technique i.e. it takes part of the load on the database (from caching) and segments this into a specialized and optionally distributed caching "database". there is a detailed step-by-step guide on lullabot on doing this on debian etch and drupal module.

you can continue this approach on other areas of your database, dedicating several databases to different roles. for example, if one of the functions of your database is to serve as a log, why not segment all log activity onto a single database? clearly, it's important that your segments are distinct i.e. that applications don't need joins or transactions between segments. you may have auxiliary applications that do need complex joins between segments e.g. reporting. this can be easily solved by warehousing the data back into a single database to serve specifically this auxiliary application (warehousing in this case).

while i'm not suggesting that the next step in your scaling exercise necessarily should be segmentation, this clearly depends on your application and preferences, we're going to explore the idea anyway. it's my blog afterall :)

what segmentation technologies to use?

there are several open source tools that you can use to build a segmentation infrastructure. sqlrelay is a popular database-agnostic proxying tool that can be used for this purpose. mysql proxy is, as the name suggests, a mysql specific proxying tool.

in this article i focus on mysql proxy. sqlrelay (partly due to it's more general purpose nature) is somewhat difficult to configure, and inherently less flexible than mysql proxy. mysql proxy on the other hand is quick to setup and use. it has a simple, elegant and flexible architecture that allows for a full range of proxying applications, from trivial to uber-complex.

more on mysql proxy

jan kneschke's brainchild, mysql proxy is a lightweight daemon that sits between your client application (apache/modphp/drupal in our case) and the database. the proxy allows you to perform just about any transformation on the traffic, including segmentation. the proxy allows you to hook into 3 actions; connect, query and result. you can do whatever you want to in these steps, manipulating data and performing actions using lua scripts. lua is a fully featured scripting language, designed for high performance. clearly a key consideration in this application. don't worry too much about aFsc (another scripting language). it's easy to pick up. it's powerful and intuitive.

even if you don't intend to segment your databases, you might consider a proxy configuration for other reasons including logging, filtering, redundancy, timing and analysis and query modification. for example, using mysql proxy to implement a hot standby database (replicated) would be trivial.

the mysql site states clearly (as of 09Nov2007); "MySQL Proxy is currently an Alpha release and should not be used within production environments". Feeling lucky?

a word of warning

the techniques described below, including the overall method and the use of mysql proxy, are intended to stimulate discussion. they are not intended to represent a valid production configuration. i've explored this technique purely in an experimental manner. in my example below i segment cache queries to a specific database. i don't mean to imply that this is a better alternative to memcache. it isn't. anyway, i'd love to hear your thoughts on the general approach.

don't panic, you don't really need this many servers

before you get yourself into a panic over the number of boxes i've drawn in the diagram, please bear in mind that this is a canonical network. in reality you could use the same physical hardware for both loadbalancers, or, even better, you could use xen to create this canonical layout and, over time, deploy virtual servers on physical hardware as load necessitated.

down to business - set up and test a basic mysql proxy

o.k., enough of the chatter. let's get down to business and setup a mysql proxy server. first, download and install the latest version of mysql proxy from http://dev.mysql.com/downloads/mysql-proxy/index.html.
tar xvfz mysql-proxy-0.6.0-linux-debian3.1-x86.tar.gz
make sure that your mysql load balancer can access the database on your data server i.e. on your data server, run mysql and enter:
GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, INDEX, ALTER, CREATE
TEMPORARY TABLES, LOCK TABLES
ON drupaldb.*
to drupal@'192.168.1.94' IDENTIFIED BY 'password';
FLUSH PRIVILEGES;
check that your load balancer can access the database on your data server i.e. on your load balancer do:
# mysql -e "select * from users limit 1" --host=192.168.1.26 --user=drupal --password=password drupaldb
now do a quick test of the proxy, run the proxy server, pointing to your drupal database server:
./mysql-proxy --proxy-backend-addresses=192.168.1.26 &
and test the proxy:
echo "select * from users" |  mysql --host=127.0.0.1 --port=4040 --user=drupal --password=password drupaldb
now change your drupal install to point at the load balancer, rather than your data server directly i.e. edit your settings.php on your webserver(s) and point your drupal install to the mysql load balancer, rather than at your database server:
$db_url = 'mysql://drupal:password@192.168.1.94:4040/drupaldb';

asking mysql proxy to segment your database traffic

the best way to segment a drupal databases depends on many factors, including the modules you use and the custom extensions that you have. it's beyond the scope of this exercise to discuss segmentation specifics, but, as an a example i've segmented the database into 3 segments, a cache server, a log server and a general server (everything else).

to get started segmenting, create two additional database instanaces (drupal-data-server2, drupal-data-server3), with a copy of the data from drupal-data-server3. make sure that you GRANT the mysql load balancer permission on to access each database as described above.

you'll now want to start up your proxy server, pointing to these instances. below, i give an example of a bash script that does this. it starts up the cluster and executes several sql statements, each one bound for a different member of the cluster, to ensure that the whole cluster has started properly. note that you'd also want to build something similar as a health check, to ensure that they kept functioning properly and stopping the cluster (proxy) as soon as a problem was detected.

here's the source for runProxy.sh:

:
BASE_DIR=/home/john
BIN_DIR=${BASE_DIR}/mysql-proxy/sbin

# kill the server if it's running
pkill -f mysql-proxy

# make sure any old proxy instance is dead before firing up the new one
sleep 1

# run the proxy server in the background
${BIN_DIR}/mysql-proxy \
--proxy-backend-addresses=192.168.1.26:3306 \
--proxy-backend-addresses=192.168.1.27:3306 \
--proxy-backend-addresses=192.168.1.28:3306 \
--proxy-lua-script=${BASE_DIR}/databaseSegment.lua &

# give the server a chance to start
sleep 1

# prime the pumps!
# execute some sql statements to make sure that the proxy is running properly
# i.e. that it can establish a connection to the range of servers in question
# and bail if anything fails
for sqlStatement in \
   "select cid FROM cache limit 1" \
   "select nid FROM history limit 1" \
   "select name FROM variable limit 1"
do
   echo "testing query: ${sqlStatement}"
   echo ${sqlStatement} |  mysql --host=127.0.0.1 --port=4040 \
       --user=drupal --password=password drupaldb || { echo "${sqlStatement}: failed (is that server up?)"; exit 1; }
done
you'll notice that this script calls references databaseSegment.lua, this is the a script that uses a little regex magic to map queries to servers. again, the actual queries being mapped serve as examples to illustrate the point, but you'll get the idea.. jan has a nice r/w splitting example, that can be easily modified to create databaseSegment.lua.

most of the complexity in jan's code is around load balancing (least connections) and connection pooling within the proxy itself. jan points out (and i agree) that this functionality should be made available in a generic load-balancing lua module. i really like the idea of having this in lua scripts to allow others to easily extend it, for example, by adding a round robin alternative. keep an eye on his blog for developments. anyway, for now, let's modify his example, add a some defines and a method to do the mapping:

local CACHE_SERVER = 1
local LOG_SERVER = 2
local GENERAL_SERVER = 3

-- select a server to use based on the query text, this will return one of
-- CACHE_SERVER, LOG_SERVER or GENERAL_SERVER
function choose_server(query_text)
   local cache_server_strings = { "FROM cache", "UPDATE cache",
                                  "INTO cache", "LOCK TABLES cache"}
   local log_server_strings =   { "FROM history", "UPDATE history",
                                  "INTO history" , "LOCK TABLES history",
                                  "FROM watchdog", "UPDATE watchdog",
                                  "INTO watchdog", "LOCK TABLES watchdog" }

   local server_table = { [CACHE_SERVER] = cache_server_strings,
                          [LOG_SERVER] = log_server_strings }

   -- default to the general server
   local server_to_use = GENERAL_SERVER

   -- find a server registered for this query_text in the server_table
   for i=1, #server_table do
      for j=1, #server_table[i] do
         if string.find(query_text, server_table[i][j])
         then
            server_to_use = i
            break
         end
      end
   end

   return server_to_use
end
and then call this in read_query:
-- pick a server to use
proxy.connection.backend_ndx = choose_server(query_text)

test your application

now test your application. a good way to see the queries hitting your database servers, is to (temporarily) enable full logging on each of them and watch the log.edit /etc/mysql/my.cnf and set:
# Be aware that this log type is a performance killer.
log             = /var/log/mysql/mysql.log
and then:
# tail -f /var/log/mysql/mysql.log

further work

to develop this idea further:
  • someone with better drupal knowledge than me could define a good segmentation structure for typical drupal application, with the query fragments associated with each application.
  • additionally, the scripts could handle exceptional situations better e.g. a regular health check for the proxy.
  • clearly we've introduced another single-point-of-failure in the database load balancer. the earlier discussion of heartbeat applies here.
  • it would be wonderful to bypass all this nonsense and get drupal running on a mysql cluster. i'd love to hear if you've tried it and how it went.

references and documentation

tech blog

if you found this article useful, and you are interested in other articles on linux, drupal, scaling, performance and LAMP applications, consider subscribing to my technical blog.

Thanks for the great

Thanks for the great document. I'm very curious on how the lua script you modified to route the different r/w queries to the different backends mysql servers? Would you plz kindly post it or send it to my email? We test with the latest mysql proxy 0.7.0, but it was said that randomly sending different queries to the backend server are not supported in the mysql proxy.

We are facing the same

We are facing the same problem.

We have been testing MySQL Proxy v0.7.
We made it work properly with 1 master database server but as soon as we added another database (either master or slave) we could not make it work as pages were returned with missing data.

In other words it dosen´t seem to work when we try to balance traffic on several databases.

And we are using Drupal v6,x.

Any help will be really appreciated.
Best regads.
Herve

hi.. any updates please on

hi..

any updates please

on whether mysql cluster and drupal can work together,

in perfect unison?

raskal

Tried installing Drupal 6.3

Tried installing Drupal 6.3 against MySQL Cluster 6.2.15 but soon bumped up against the 8052 byte row length limit not including TEXT and BLOBs). I think Cluster needs to evolve significantly before it's ready to handle applications like Drupal.

alpha tho' it may be, mysql

alpha tho' it may be, mysql proxy is the correct response to this issue. sqlrelay had the right idea but did not work transparently w/ the mysql client protocol, and the drop-ins weren't sufficient (exa: retooling necessary for things like native mysql-ruby + sqlrelay... can be done, but big PITA).

/me too doesn't think MySQL

/me too doesn't think MySQL cluster is a good fit for Drupal.
I`ll put up some more toughts about it some day :)

yea, i'd love to hear them.

yea, i'd love to hear them. do you think that there are reasonably easy modifications to drupal that might make it possible, or would it involve re-writing significant amounts of code?

I'm so glad to see you

I'm so glad to see you tackling this topic! I've thought for a long time that there was a lot of potential here, but have never had the right conditions to explore it. I once looked through the code and came up with the following list of possible segmentation targets, which I'll post to prompt further discussion:

Thes tables are the "core" content and user tables. They belong together:

* access_log
* access
* book
* comments
* files
* file_revisions
* forum
* history
* node
* node_revisions
* node_comment_statistics
* node_counter
* poll
* poll_choices
* profile_values
* sessions
* term_data
* term_node
* users
* vocabulary
* vocabulary_node_types
* watchdog

It's unfortunate that these join node and node revision. They belong in the group above but I wish they could be isolated:

* search_dataset
* search_index
* search_total

These three tables must appear in the same database together:

* role
* users_roles
* permission

These tables could theoretically each appear in their own database as they are never used in JOIN queries.

* authmap
* blocks
* boxes
* cache
* client*
* contact
* filters
* flood
* menu
* sequences
* system
* variable

ask and you shall receive!

ask and you shall receive! :) thanks robert for your breakdown. as you point out it's a shame that the search tables are intertwined with the node tables ... but it would still be possible to replicate these to a (possibly clustered) dedicated search database and direct all read-only search queries there, since the replication lag problem isn't a big issue for search.

I ran Drupal on MySQL

I ran Drupal on MySQL Cluster for a short time. With the addition of MySQL Proxy in theory have a great scale out strategy. But MySQL Cluster does have limitations. For one it doesn't like a lot of joins which Drupal does.

post new comment

the content of this field is kept private and will not be shown publicly.
  • web page addresses and e-mail addresses turn into links automatically.
  • allowed html tags: <h2> <h3> <h4> <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • lines and paragraphs break automatically.
  • you may post code using <code>...</code> (generic) or <?php ... ?> (highlighted php) tags.

more information about formatting options

captcha
are you human? we hope so.
copy the characters (respecting upper/lower case) from the image.