Friday, November 21, 2008

HTTP cache optimization

You know what else Twitter is good for? Quotations. As good friend and colleague, Piyush Ranjan said:

Recession is [a] good time for innovation. Best innovation happens in crunch situations. Otherwise people just throw money at the problem!

At Cleartrip, as with any other company, we're always looking for ways to reduce our costs. Now with the travel industry in India hit hard, and the general economic slowdown, it's very interesting for us techies to innovate in ways that will help us cut costs.

One the recent things we've been working on, and the stuff I want to talk about here, is the area of bandwidth savings. The benefits of bandwidth savings are usually twofold. Firstly and obviously, it reduces our bills. Secondly, it usually manifests itself as a better experience for the user, since he doesn't have to wait as long for stuff to transfer from our servers to his browser every time he moves between pages.

Now, our site is a fairly dynamic site. Almost every page in the site (well, the significant ones at least), are so dynamic, they cannot even be cached for a couple of minutes. So, caching HTML doesn't suit us. What we can cache is everything else - images, CSS and JavaScripts. Also, since we try to adhere to web standards as much as possible while being as Web 2.0 as possible (whatever that means), our use of CSS and JavaScripts is pretty heavy.

Phase 1

So, a couple of months ago, we made our first set of optimizations. We noticed that our images change very rarely - some of them hadn't changed since we first made them. There was no point sending down copies of the images to our users frequently. These could be cached pretty aggressively. We decided on an arbitrary time of 1 month for the caching of images.

Next came the JavaScript and the CSS. This posed a considerable problem. We keep doing JS and CSS changes very frequently on our site - as frequently as a couple of times a day. We couldn't afford to have them aggressively cached at the client end. We needed the agility to update our users' cache with the new versions of our code. Again, arbitrarily, we decided that the cache time for JS and CSS files would be 2 hours. That way, it wouldn't be in users' cache very aggressively, while remaining in his cache for about the length of his usage session. At the same time, it would take at most 2 hours for our changes to propagate to all users.

So, with this configuration change rolled out (it's not at all hard to do if you understand caches well) we sat back and monitored our savings. Turns out, the savings were pretty amazing. Hrush, our founder, even blogged about how our data center providers were surprised to the extent that they said that it was "just not possible" that we have such low bandwidth consumption for the traffic we get!

The problems

How ironic our data center guys said that, because just around that time, we were looking at how we can optimize bandwidth usage even more. And why did we decide to optimize further?

  • Because you can
  • Some of our files never changed at all - core JS libraries, for example. Why bother downloading them again after 2 hours?
  • Because this setup was hard to work with. Remember I said we patch stuff to production very frequently? That it takes two hours to propagate meant that if it was a bug fix, it might be broken for two hours even after we've put up the patch.
  • Because it gets very hard to make changes that have dependencies with other resources. For example, if the JS needs certain CSS changes, and certain markup changes, in what order do you put the changes up on the server?

This usually meant that we had to plan in advance and we would still have some backwards-compatibility cruft in our JS to manage this 2 hour transition. Sometimes it would take an entire day to make even a minor patch since it had to be batched with breaks of two hours. One quick-fix would be to rename the file and all references to it, but it's easy to see how painful that is to do on a file-by-file basis, not to mention that we'd run out of file names pretty soon!

Phase 2

One thing for sure, though. If we could somehow manage the file name issue, we can solve all the problems I listed out. That's because every time we made a patch with a different file name, it would instantly be available to all users, irrespective of their cache status. This gives rise to another interesting possibility: we didn't have to restrict ourselves to the 2 hour window anymore. We could potentially increase our cache expires time to 20 years for all that matters. That should give us MUCH more saving than we get currently, and our users will download as little as necessary.

Except, we didn't know yet how to solve the file name problem gracefully. Ideally, there shouldn't be a human being deciding which files have changed, what the file names are, and going all over the place and changing them. That would make it tedious and error prone, not to mention boring. Then of course you have to think about how these file name modifications would work with source control. You don't want to mess up your clean source code management history.

The solution to the source control mess thing was rather easy. We fake the file name. Here's what we do. We take a file name and append a random number to it. This will make the client believe that the file name has changed, and will negotiate with the server for the file. Meanwhile, at the server, we could have a rewrite rule that transforms the file name to something that maps to a real file on our server. Sounds simple enough. Tried it, and it worked like a charm.

Now, to crack the real problem - how do we generate these numbers in a sensible way. The number essentially had to be such that it would never repeat (at least for any given file), it would be global in that two pages shouldn't be using different numbers for the same resource, and when the number changes it should instantly reflect site wide. Now that we knew how the number should behave, the quest was on to come up with a mechanism to generate these numbers.

The solution

After a lot of thinking, we had a shameful duh moment, and it suddenly all made sense. We didn't need to invent these numbers at all. We just needed to use our source-control revision numbers! The revision numbers match all the characteristics of the number we want. Why bother with complex systems to generate and track numbers when it was already available, even if very disguised.

I'll save you the implementation details about how we made this available site wide, and how we made it possible to have instant global changes to this number. That definitely wasn't the tough part, and I'm sure you'll figure out the details. Who knows, maybe Piyush might just release a plugin for Rails to do it automatically for you. However what surprises me is that it's very hard to find such gems of knowledge on the net. I'm now beginning to think that maybe this design pattern should be used for distribution of all static resources on the web. We're definitely not the first to invent this pattern - why is no one else talking about it?

We've only just rolled this out on cleartrip.com and are still to get a decent sample of data to see how it has impacted our bandwidth consumption. But any fool could guess that our bills should reduce significantly with this change.

2 comments:

Enygmatic said...

This is over an year late, but nevertheless couldn't you have also used the file change timestamp to get unique file names ?

Rakesh Pai said...

Enygmatic,

I believe timestamps is the approach that Rails takes.

It can be done with timestamps just fine, but some thought will need to be put into it. For example, if your deployment mechanism destroys the old files and puts the new ones there (like a delete/copy or a overwrite), the timestamps will not reflect the actual last-modififed time. This means that all files will be marked as new, and the client will need to re-download them. To mitigate this, you'll need to do incremental copies, which is difficult because of all the book-keeping.

So, it works, yes. But one will need to be sure s/he knows s/he's doing.

ShareThis