Drupal CDN & Static File Server - The Amazon S3 Way

by Quinton Figueroa on December 1st, 2008

If you build quality sites that attract a large number of visitors and interaction there eventually will come a point when you have to start looking for ways to offload your files and bring down your server overhead. I have been looking into the CDN issue off and on for the past 6 months. Recently I decided it was time to get something dialed in and move forward. I wanted something that required the least number of hacks and was easy and scalable. This post isn't meant to be an end all to Drupal and CDNs, but rather just some insight into the way I have tackled this issue for the time being.

There are a number of options to choose from and a lot of different ways to go about it. You could get a new server locally and load balance your stuff, you could get a new server locally and use it as a static file server, you could team up with a big time CDN like Akamai or Limelight and go that route, or you could go the less expensive CDN route with something like Amazon S3.

Note: This article doesn't attempt to explain every little detail of what's going on, but rather act as a guide for a developer to work off.

My Goals

Being somebody that is cost conscience and wanting to always try to get the most bang for my buck the idea of a CDN like Limelight didn't quite seem like an appropriate fit for me. I wanted something a little bit less serious that I could ease into. I don't care about having servers all across the world and getting content to people a few milliseconds faster than other solutions. I just want to take a huge load off my local network and put that load somewhere else. If my site is up and the users are happy, I'm happy.

I also don't care if 100% of my site files are not offloaded to the CDN. If I can take care of 95% of my load and leave 5% on my main box it really doesn't bother me. My main goal is to get rid of the majority of the overhead and keep everything scalable and dynamic with a relatively small code footprint.

Which files to offload

With these things in mind I have basically decided that a good route for me was to mainly take care of my two main sources of files: images and videos. If I really wanted to get hardcore I could also move my CSS and Javascript, but I don't see this as important as taking care of the major problems first.

Take my site www.gunslot.com for example. If you hit the homepage you will see that there are a lot of thumbnails and image requests. Each page probably has at least 20 image requests, some as much as 50. So I would rather take care of these requests first and then I can worry about the 5 or so CSS and JS requests later if I have to.

And then I have the video content which can be large in filesize and taxing on the server. This also had to go.

Local file system

Another thing to keep in mind is that I don't want to make my CDN act as my local file system. I don't want people to upload straight to this and only have this as my main system, I just see too many errors and bugs going this route. I would rather just have everything work normally and smoothly through the default local structure and just copy stuff over to the CDN, which brings up an important point.

Synching your content

Just how should you copy your content over to the CDN? Big time CDNs like Limelight allow HTTP synchronization (which Ted Serbinski talks about in his article) which basically copies your files over automatically. S3 does not offer this type of functionality so you will need to go another route.

You could simply copy them all at once programatically and call it a day. You could also maybe set up a cron and copy any new files every few minutes. But how do you know which files are new? How do you know which files get updated and which don't? You could run some PHP scripts like a cURL or file_get_contents to check the file's last modified time, but this has some big overhead (I tried). So when is the best time to copy the file and how should you do it? You most likely want to get the file over to the CDN as quick as possible, but at the same time you want to make sure if the user updates the file or something is changed your CDN reflects that.

My S3 Solution

I chose to go with a hybrid type approach. I basically send a file over to the S3 every time it is requested, if it is not already there, or if it is newer than the current file.

I have one main routing function that when called will run through the flowchart below and figure out what path to return for any given file, either:

Local: http://www.domain.com/files/full/path/myfile.jpg

or

S3: http://s3domain.com/files/full/path/myfile.jpg

So basically my flowchart goes something like:

The Technical

S3 Interfacing

So if any of you have messed with S3 you know that you are going to need a PHP class first to put your stuff there. I found a decent class off their forums and went with that (it requires PEAR, unfortunately). At one point I'll probably change this to something better if I find one, but it gets the job done for now.

Filetype check

This is just a quick little function that basically checks if the filetype is listed in an array of allowed filetypes that I have chosen, like: jpg, jpeg, gif, png, flv, mp3, etc.

If file exists

So this is one of the most important steps that I took to make this possible. Rather than actually checking if the file exists on S3 with one of the S3 class functions (slow) or via something like cURL or file_get_contents (slow) I went with a database table on my own server that keeps track of what is on the S3 server (fast). This table keeps track of every filepath and when it was created and changed on S3. I am able to use the changed timestamp from my table to check it against the local file's timestamp (filemtime) to know when the file needs to be updated on S3.

Queuing the file for S3

Originally I didn't queue the file at all and just tried to put the file to S3 every time it was requested - mistake. It obviously took waaaaaaay to long to render my pages with this overhead so I opted for the queue method. I created another database table that keeps track of every file that needs to be put to S3. Basically every time a file doesn't exist on S3 I return the local path for the time being and add this file to the queue.

Putting to S3

As for putting the file to S3 there are a lot of ways to do this and it probably depends a bit on your situation. You could run a cron or some type of rsync if you really wanted to dial it in, but for the time being I am going for a bit more simpler method. I simply run 1 put operation at the end of each page request by the users. This seems to work really well right now and gets the files uploaded pretty much within seconds of when they are queued. At any time I have less than 20 or so files in the queue, and obviously once most of my files are on S3 this doesn't need to run anymore. I run my put function using hook_exit and I pick 1 file each time and get it over to S3.

Routing the files with Drupal

So once you have all this ready to go how do you actually get Drupal to replace the current local paths with the S3 paths? Well, there are a number of ways to do this. If you want to replace all your images, CSS, logos and all that stuff you can patch common.inc, file.inc and theme.inc (maybe more) to run through your routing function. Since I don't care too much about this stuff I skipped this part and decided just to replace imagecache and my videos (for now).

Imagecache

Imagecache comes with a sweet theme_imagecache function that allows you to simply over-ride it in your template.php. You basically just need to tweak the $imagecache_url to run through your routing function and decide which path to return.

Notes

If you are using Imagecache for profile pics or nodes you may need to flush the imagecache when these are updated so that your new file will run through the queue and be uploaded to S3. This can be done via some hook_form_alter or hook_nodeapi calls that run the imagecache_image_flush function.

Videos

My videos are done through a custom module so I can't really offer any help here. I just plugged my routing function into it.

So that's it!

As you can see for each file request that is going through the routing function it will always check to make sure the newest file is on S3. If it's not it just pulls the local file for the time being. As soon as the S3 file is ready it will start pulling that one. If you do an update it will pull the new local file until the new S3 file is ready again.

That's basically how I'm doing it right now and it's working really well. My server is thanking me and even with fairly high traffic the S3 costs are very reasonable. If anyone has any feedback or tips on how I could make this better lemme know!

 Filed under: Internet / Tech, Drupal, S3, File Server, CDN

About The Author

Quinton Figueroa

Quinton Figueroa

Facebook @slayerment YouTube

El Paso, Texas

I am an entrepreneur at heart. Throughout my whole life I have enjoyed building real businesses by solving real problems. Business is life itself. My goal with businesses is to help move the human ...

More

16 Comments

shaal: cloudfront

Amazon recently announced Cloudfront - their own CDN based on the S3 service.

are there any benefits working directly with S3, as you described - than working with Cloudfront?

and would u know if there are any modules that work automatically with s3 or cloudfront?

Quinton Figueroa: I have not experimented with
@shaal (view comment)

I have not experimented with Cloudfront yet, although I may in the future. From my understanding the only real difference is that it acts more like a true CDN as it delivers content from 14 locations throughout the world rather than the 3 locations that S3 provides. Since it works entirely off S3 and all you have to do is mark your content to be distributed from Cloudfront rather than S3. I think it's a great option if you are looking to cut down on latency and get your content out even faster than S3.

There are a few modules out for S3 but I'm not sure how complete they are. I think I checked them out a few months back and decided to write my own functionality, but they may work for what you need.

Jason: Nice overview

Thanks for the nice clear overview of your approach and decision making. Very helpful. I'm glad I spotted it on the Planet Drupal feed. And I kind of want a Glock 19 now too...

Quinton Figueroa: Thank you. Good choice, the
@Jason (view comment)

Thank you. Good choice, the Glock 19 is one of my favorite handguns :).

Felisite: Hi, thanks for the very nice

Hi, thanks for the very nice article!

I'm currently working on a configuration similar to yours and I was wondering: aren't you afraid that you'll still hit a bottleneck with this solution eventually?

I get it that you don't want S3 to fully replace your local file system, but then you end up with 2 copies of each file: one on your local file system, and another one on S3.

Now say you have millions of files on your site (let's hope you have someday ;-)), that means millions of files on your local webserver and millions of rows in the Drupal "file" table. This could be a problem even though those files are not requested directly (i.e. a single directory containing millions of files will be slow to read, a MySQL table with millions of rows will be slow to read...).

It's certainly convenient to have the data (binary + metadata) locally in Drupal when editing the nodes, but wouldn't it be even more scalable to have everything on AWS (e.g. S3 + SimpleDB).

It'd certainly be much more difficult to implement but I'd be curious to have your opinion. Thanks!

Quinton Figueroa: Thank you for the comment. I
@Felisite (view comment)

Thank you for the comment.

I am not too familiar with SimpleDB so I can't really speak for that.

As for having the files on both servers I definitely see how this could be an issue if you had a lot of files and at that point it may be worth changing the process around a bit, but my method wouldn't probably fit this model as well. However, I know there are some fairly large sites that do keep files both locally and on the CDN. So I don't see it as a super big deal, even with millions of files. Right now I have about 40,000 files on S3 and have no problem moving that number close to a million. I don't really view the CDN as a permanent storage location but rather a temporary server of files. At any time I may delete all the files on S3 and rebuild the list. That's just my view on it.

And to address your single file directory concern I agree with you, however, my method is independent of your file directory structure. So you can split a million files between as many directories as you like and it will mirror that same structure over to S3.

So you could have:

files/images/34.jpg
files/imagecache/thumbnails/cool_file.jpg
files/videos/myvideo.flv

which will copy over no problem since the database stores the entire full path and not just the filename. This is also super helpful if you have multiple sites pulling from different file directories (which I do).

As for the DB issue with lots of records that is mainly just up to whatever works best for you. This file method should work with however you want to store your files in a DB. Hopefully if you reach a million files you will have the funds to scale your DB to multiple servers or something smart.

Hope that helps.

Jo Wouters: Make sure you rebuild the internal tabel

Since S3 (only) garantees a 99.9% uptime (http://aws.amazon.com/s3-sla/) it might be usefull to add a way to 'recover' your internal S3-locations table from a S3 failure (or a failing transfer).

Once in a while the script could sent out a "GET bucket" request, receiving up-to-date info about all the S3 files in order to rebuild the internal table with up to date information.

You could also use this when you receive a lot of files (or when you have to recover from an S3 failure): first you do a s3sync (http://s3.amazonaws.com/ServEdge_pub/s3sync/README.txt) and then you run the rebuild-script.

Anonymous: Thanks for the nice clear

Thanks for the nice clear overview of your approach and decision making. Very helpful. I'm glad I spotted it on the Planet Drupal feed . And I kind of want a Glock 19 now too...

J-P: File on S3 / Rewriting all URLs?

Hi,

Great article. It sounds like you've got a fairly sturdy stack there, even if bits of it are a bit bleeding edge. Do you have any ways of monitoring how successful it is e.g. how many false positive and negative answers to the question "is this on S3?"

Also:

1. Between "If file exists" and "Queuing the file for S3", you seem to have a bootstrapping problem. How does a file get into the table in the first place? Do you check just once to see if it's on S3, then stick it in the table and on the queue?

2. For URL rewriting, any reason not to use custom_url_rewrite_outbound, and rewrite across all (well behaved) themeing?

content writing: Great Post

Really i am impressed from this
post....the person who create this post he is a great human..thanks for shared this with us.i found this informative and interesting blog so i think so its very useful and knowledge able.I would like to thank you for the efforts you have made in writing this article. I am hoping the same best work from you in the future as well. In fact your creative writing abilities has inspired me.

Ashish Kumar: Nice post

Great post - I love Drupal and we have also tried this, this is a great post for us to point to the customers for better explanations of what we are trying to do and achieve.

Heshan: How about the

How about the stream_wrapper? would that be the best solution?

Kevin: MaxCDN Drupal CDN with plugin

Very nice write-up on using Drupal and S3 with your CDN. Another good option to check out is MaxCDN's Drupal CDN. With the Drupal plugin, the cdn integrates quickly and easily and works seamlessly with S3 as well.

Todd Shaffer: Nice rundown on Drupal + Amazon S3 CDN

Good to see someone posting information in plain English. Well done sir!

trieskaQ: I would welcome more details

I would welcome more details on this issue.

ali: good

nice

Add new comment