"I can accept failure, but I can't accept not trying."
"I can accept failure, but I can't accept not trying."
If you build quality sites that attract a large number of visitors and interaction there eventually will come a point when you have to start looking for ways to offload your files and bring down your server overhead. I have been looking into the CDN issue off and on for the past 6 months. Recently I decided it was time to get something dialed in and move forward. I wanted something that required the least number of hacks and was easy and scalable. This post isn't meant to be an end all to Drupal and CDNs, but rather just some insight into the way I have tackled this issue for the time being.
There are a number of options to choose from and a lot of different ways to go about it. You could get a new server locally and load balance your stuff, you could get a new server locally and use it as a static file server, you could team up with a big time CDN like Akamai or Limelight and go that route, or you could go the less expensive CDN route with something like Amazon S3.
Note: This article doesn't attempt to explain every little detail of what's going on, but rather act as a guide for a developer to work off.
Being somebody that is cost conscience and wanting to always try to get the most bang for my buck the idea of a CDN like Limelight didn't quite seem like an appropriate fit for me. I wanted something a little bit less serious that I could ease into. I don't care about having servers all across the world and getting content to people a few milliseconds faster than other solutions. I just want to take a huge load off my local network and put that load somewhere else. If my site is up and the users are happy, I'm happy.
I also don't care if 100% of my site files are not offloaded to the CDN. If I can take care of 95% of my load and leave 5% on my main box it really doesn't bother me. My main goal is to get rid of the majority of the overhead and keep everything scalable and dynamic with a relatively small code footprint.
Take my site www.gunslot.com for example. If you hit the homepage you will see that there are a lot of thumbnails and image requests. Each page probably has at least 20 image requests, some as much as 50. So I would rather take care of these requests first and then I can worry about the 5 or so CSS and JS requests later if I have to.
And then I have the video content which can be large in filesize and taxing on the server. This also had to go.
Another thing to keep in mind is that I don't want to make my CDN act as my local file system. I don't want people to upload straight to this and only have this as my main system, I just see too many errors and bugs going this route. I would rather just have everything work normally and smoothly through the default local structure and just copy stuff over to the CDN, which brings up an important point.
Just how should you copy your content over to the CDN? Big time CDNs like Limelight allow HTTP synchronization (which Ted Serbinski talks about in his article) which basically copies your files over automatically. S3 does not offer this type of functionality so you will need to go another route.
You could simply copy them all at once programatically and call it a day. You could also maybe set up a cron and copy any new files every few minutes. But how do you know which files are new? How do you know which files get updated and which don't? You could run some PHP scripts like a
file_get_contents to check the file's last modified time, but this has some big overhead (I tried). So when is the best time to copy the file and how should you do it? You most likely want to get the file over to the CDN as quick as possible, but at the same time you want to make sure if the user updates the file or something is changed your CDN reflects that.
I chose to go with a hybrid type approach. I basically send a file over to the S3 every time it is requested, if it is not already there, or if it is newer than the current file.
I have one main routing function that when called will run through the flowchart below and figure out what path to return for any given file, either:
So basically my flowchart goes something like:
So if any of you have messed with S3 you know that you are going to need a PHP class first to put your stuff there. I found a decent class off their forums and went with that (it requires PEAR, unfortunately). At one point I'll probably change this to something better if I find one, but it gets the job done for now.
This is just a quick little function that basically checks if the filetype is listed in an array of allowed filetypes that I have chosen, like: jpg, jpeg, gif, png, flv, mp3, etc.
So this is one of the most important steps that I took to make this possible. Rather than actually checking if the file exists on S3 with one of the S3 class functions (slow) or via something like
file_get_contents (slow) I went with a database table on my own server that keeps track of what is on the S3 server (fast). This table keeps track of every filepath and when it was created and changed on S3. I am able to use the changed timestamp from my table to check it against the local file's timestamp (
filemtime) to know when the file needs to be updated on S3.
Originally I didn't queue the file at all and just tried to put the file to S3 every time it was requested - mistake. It obviously took waaaaaaay to long to render my pages with this overhead so I opted for the queue method. I created another database table that keeps track of every file that needs to be put to S3. Basically every time a file doesn't exist on S3 I return the local path for the time being and add this file to the queue.
As for putting the file to S3 there are a lot of ways to do this and it probably depends a bit on your situation. You could run a cron or some type of rsync if you really wanted to dial it in, but for the time being I am going for a bit more simpler method. I simply run 1 put operation at the end of each page request by the users. This seems to work really well right now and gets the files uploaded pretty much within seconds of when they are queued. At any time I have less than 20 or so files in the queue, and obviously once most of my files are on S3 this doesn't need to run anymore. I run my put function using
hook_exit and I pick 1 file each time and get it over to S3.
So once you have all this ready to go how do you actually get Drupal to replace the current local paths with the S3 paths? Well, there are a number of ways to do this. If you want to replace all your images, CSS, logos and all that stuff you can patch
theme.inc (maybe more) to run through your routing function. Since I don't care too much about this stuff I skipped this part and decided just to replace imagecache and my videos (for now).
Imagecache comes with a sweet
theme_imagecache function that allows you to simply over-ride it in your
template.php. You basically just need to tweak the
$imagecache_url to run through your routing function and decide which path to return.
If you are using Imagecache for profile pics or nodes you may need to flush the imagecache when these are updated so that your new file will run through the queue and be uploaded to S3. This can be done via some
hook_nodeapi calls that run the
My videos are done through a custom module so I can't really offer any help here. I just plugged my routing function into it.
As you can see for each file request that is going through the routing function it will always check to make sure the newest file is on S3. If it's not it just pulls the local file for the time being. As soon as the S3 file is ready it will start pulling that one. If you do an update it will pull the new local file until the new S3 file is ready again.
That's basically how I'm doing it right now and it's working really well. My server is thanking me and even with fairly high traffic the S3 costs are very reasonable. If anyone has any feedback or tips on how I could make this better lemme know!Filed under: Internet / Tech, Drupal, S3, File Server, CDN