Purpose of the entry
On Saturday June 13th 2009 I attended a talk by Eli White on Scalable web applications. Eli White previously worked at digg.com and now holds the position PHP Community Manager & DevZone Editor-in-Chief at Zend Technologies. When you hear him talk you immediately notice he has a high knowledge on programming, good practices and he is eager to distribute his knowledge and does it with passion. That explains his position at Zend. Eli gave for what I am concerned the best talk at the conference. Of course I didn’t attend all talks but I can hardly imagine a talk more interesting than this. He gave valuable tips on how to implement new projects with small tips. He showed you how you could already take some small steps that would lead to great benefit if your application would turn out to be mass consumed. The following entry is my interpretation of the talk itself.
Overview
- What is scalable application design
- Tip 1: load balancing the webserver
- Tip 2: scaling from a single DB server to a Master-Slave setup
- Tip 3: Partitioning, Vertical DB Scaling
- Tip 4: Partitioning, horizontal DB Scaling
- Tip 5: Application Level Partitioning
- Tip 6: Caching to get around your database
- Resources
- Closing notes
What is scalable application design
Scalable application design first means, there is a design for an application. So there is a start on the application itself. It is no use designing and thinking on how to create your killer application by trying to fit in as many good practices as possible, so that in the end you feel discouraged in even beginning to program. First step in creating an application is actually beginning. Do not overkill the project with great ideas which will take forever to implement or require you to study till you drop. Does this mean you do not have to study or apply good design principles? Of course not, you should on a permanent basis. It should become a habit, a second nature. It just means do not overdo it in the way you will never begin the project at all. To make it easy for you Eli gave some practical tips which you can apply immediately without any problems and which doesn’t require any real study or understanding of design principles.
Tip 1: load balancing the webserver
Load balancing is the act of dispatching a user his request towards one server that is part of a server farm. This way the load (generated by user requests) is distributed amongst several servers and your application will be able to respond quicker. There are some caveats to keep in mind when coding for a system that will eventually have load balancing.
Do not program depending on a cache that is written to the filesystem, do not rely on the filesystem at all. The reason why is straight forward. Because the requests are distributed amongst the servers the user his second request might end up on a different server where the cache is not available. So if you are using local caching (APC / Zend Server) avoid assuming exclusive/single cache. Most of us start with only having one server, so it might be a good idea to encapsulate the sessions, cache reliabilities. So if you switch to load balancing it is only a matter of changing the code in the encapsulation.
Tip 2: scaling from a single DB server to a Master-Slave setup
class DB {
private $master;
private $slave;
private static $cfg = array(
'write' => array('mysql:dbname=MyDB;host=213.136.52.29'),
'read' => array('mysql:dbname=MyDB;host=213.136.52.30',
'mysql:dbname=MyDB;host=213.136.52.31',
'mysql:dbname=MyDB;host=213.136.52.32');
);
public function __construct()
{
$this->master = $this->_getConnection('write');
$this->slave = $this->_getConnection('read');
}
public function query($query, $pool)
{
if('master' == $pool) {
return $this->master->query($query);
} else if('slave' == $pool) {
return $this->slave->query($query);
}
}
private function _getConnection($pool) {
$max = count(self::$cfg[$pool]) - 1;
$dsn = self::$cfg[$pool][mt_rand(0, $max)];
return new PDO($dsn, USER, PASS);
}
}
Adopting a Master-Slave setup is not available for the start-up developer. It is a costly thing, and most of us have a shared hosting account or one dedicated server. Does that mean you should not already implement it? Of Course not. You should already accommodate your code for this setup. And it can be done with little extra coding. Create an extra layer which can instantiate connections to different DB servers. As many as you want for Master and as many as you want for Slaves. Implement your layer that you have a query method which accepts an extra parameter. Everytime you query the database for a write action (insert, delete or update) you pass the extra param as ‘master’ and for all reads (select) ‘slave’. Then in your layer delegate the query to the specific server: master or slave. I only have one dedicated server so In my code I make sure both ‘master’ and ‘slave’ point to the same database connection. But as soon as my website gets heavy traffic I can add another DB server, and all I have to do is change one method and that is it. All my write queries go to the master which is replicated to the slave and all my select queries are done on the slave. You just made your application more scalable.
Is it really that simple? In fact it is.
$db->query('update article set comments += 1', 'master');
$db->query('select comments from articles', 'slave');
Of course you have to take several other issues into consideration. One of those is slave-lag.
Slave-lag is the time needed between an Insert, Delete, Update query on the master and the replication towards the slave. We are speaking miliseconds here, but it could be enough for you to mismatch the latest ID from your select statement. Write your code that it is not dependent on lag. Lag from any matter because if you really start to scale bigtime, you will scale over several data centers around the planet and then replicating one farm to another can take seconds, minutes…
Tip 3: Partitioning, Vertical DB Scaling
Sometimes tables can become to big when you talk about the number of columns they have. That will mean that with every action on the database the query will experience overhead. When a table of 15 columns has to be searched on one field and only 2 columns should be returned almost every time then it can make sense to split up the data in multiple tables. This will decrease the size of the tables and thus also the time and memory needed to query them. You can vertically partition your tables when one of the following situations presents itself:
- Columns that are rarely used can move to their own table
- Columns that are often empty
- And columns that are not used in the where clause
Tip 4: Partitioning, horizontal DB Scaling
A table can have too much columns, but a table can also reach a point that it can have to much rows. In this case it might be wise to split the table in rowsets.
The splitting can be done on several ways. Here are the more common ones.
- Range Based
- Date Based
- Interlaced
- User Based
Depending on your application you can use any of the methologies described above. For a news site you might want to give users access to the archive with articles of last year. Mostly we only access the ones from this year. Date based would be a good way to go for such an application. Range based is when you have for instance 8 million users in your database and want to split it up per one million rowsets based on the userId or any other primary key. Interlaced means every row is switched between tables. First row goes to the first table, second to the second table, third to the third, fourth back to the first table and so on. And last but not least is user based. You can split upon username alphabet or any other characteristic, but remember all this horizontal scaling must be retrieved from your application code. So think hard on how you want to split. It is to no use if you split it up and you still have to query each and every table to find the data you are looking for. So put some thought in it, you need to be able to map it later on from your code.
Tip 5: Application Level Partitioning
Application level partitioning involves moving various tables of your DB onto different servers. These can be single tables or related tables so you still are able to join them in your query.
Tip 6: Caching to get around your database
Your performance can get a real boost with the right DB result caching. Always try to Write through cache in as close to the final processed form possible but choose small, discrete, reusable data units. Of course you should not be storing data you can’t recreate. Take special note on the discrete, reusable data units. These are units that can be re-used by other parts of your application or by the same part but on different requests. This way caching can be shared and no (extra) calls need to be made to the database.
For example Facebook has millions of users and those millions of users can keep track of hundreds of friends. If every status update would have to be queried from a database the application would be painfully slow. Facebook solved this issue with reusable data units. When you have 100 friends in your list facebook creates 100 small cached units for every friend individually. If one of your friends also has 100 frends and shares 50, Facebook can reuse 50 units and get the rest from the database. Another user might have 75 friends, 40 the same as you and 25 the same as your friend. 65 data units can be reused and only 10 have to be extracted from the database. AS you can see this way you can avoid a lot of calls. What happens when a user makes a status update? Well the status update is written towards the master Db and to the cached data unit of that specific user. No query is needed to retrieve the data. All application parts that use that unit are updated automatically. Only when the cache would turn out invalid a select query is done. I hope you see the real power in this kind of caching.
Resources
Closing notes
Most bottlenecks in the next hype is at database level. Distributing your data in databases accross multiple database servers is often the best and only solution. The problem is if you have to rewrite all your code to accommodate these changes across multiple servers it might take a lot of budget. Which can be easily prevented.
Of course you do not yet have these scaling problems. Hopefully you eventually will and that means you should already encapsulate your code that will handle the queries. By encapsulating you have a single spot where you will have to make the necessary changes (if any, because you could start developing with all these tips and tricks already in mind and implement them). For small applications the overhead that you create with for instance vertical partitioning is small. But the benefit is huge when your application is the next big thing. The caching is something you could already implement to the full extend.
Now it is up to you!
Have fun.
Nick Belhomme