Monthly Archives: March 2010

7 Methods to cache web applications

The best web caching system is the one that allows visitors to use your site or web application without fetching anything from your server ... well almost anything.

By fetching as little as possible your server gets less hits so it minimizes the load and the need to acquire new hardware and complicated setups but this also improves user's experience a lot because the web application will load a lot faster since most files ( scripts, css, images ) are already on her/his disk.

The idea is to set such a high cache expiry time or ( max-age and other parameters ) that they ( browsers ) would not even want to look for newer versions for a long time ( like a year or more)

Here's what I learned recently when trying to optimize a big web application built on javascript  and php:

0) Page analysis

Before you get started get Page Speed or Yslow and do an analysis on your app/site then come back here and see how you can solve the caching problems listed there.

1) High cache age is good but what do you do when your site changes?

You definitely want to force the changes to your users right ?

Answer: version everything.

You may have noticed the way a lot of sites include scripts and css files with a version at the end like : jquery.js?ver=1232442

Here's how this works: The main page that includes this script is not cached so the visitor will load it every time but the browser caches the jquery.js?ver=1232442 url ( because you said so in your web server config ).

Now if you update jquery to a new version all you have to do is modify the url like jquery.js?ver=1232443 in the main page and the browser will know it has to fetch the jquery.js file again because from it's point of view it's a totally different file.

If you can use php in the template that outputs the page you could even do something like:
<script src="jquery.js?ver=<?=filemtime('jquery.js')?> . By doing this you don't have to worry to update the main page when you update jquery.js.

2) CSS/HTML rewriting.

So you do this versioning thing for javascripts and maybe css files but what do you do about images? How do you cache them and still make sure your visitors will always see the latest version?

Your images are referenced from the css files or HTML content. You probably already serve your HTML content through a script ( cms ? ) so you'll have to modify this script to automatically add the versioning string to each image or other static file you want cached.

For css do the same, serve it through a script and before you output modify the image paths or even better, especially if you use multiple css files, write a script that generates one file from all ( it loads faster this way ), does the rewrite, then minifies it, save it, then it compresses it and stores the compressed version so you can serve that when possible. You would have to run this script every time you change something to your css code.

3) HTTP proxies cache differently.

It is believed that most will not cache URLs with query strings in them like jquery.js?ver=122323

A HTTP proxy can minimize the hits on your server by fetching only once and distributing to more then one user but if you want to take advantage of that you have to use a different versioning scheme.

An idea is to insert the version before the file extension like: jquery-122323.js so the URLs would no look "dynamic" anymore.

If you do this and you don't actually want to rename all the files you could use some mod_rewrite rules to redirect anything matching that pattern to the actual files.

4) HTTPS is a different animal

Yeah browsers will not cache content that comes over https because it's considered a security issue. Imagine your app generates a pdf or image with sensitive user info and "says" yeah you can cache it for a year, and the user downloads it in a publicly available computer. The next user will get the same file. Of course this would happen over HTTP too so be careful with what you allow to be cached. The only difference with HTTPS is that the browser will disregard normal caching instructions if the file is served over HTTPS.

Now you would say "why would you even want to send generic scripts, css or images over https?" ...right ... Well you do because if you allow HTTPS access to your app and you don't send everything over HTTPS then the browsers would warn the user that not everything on the page is encrypted. Now some users wouldn't care especially if they know what the warning means or how to check what's not encrypted, but other's might freak out about it.

So if you want to send everything over HTTPS and you want the browser to cache the files you have to set  the header "Cache-control: public" but again....make sure you only set this for static files that are generic for all users.

And if you set Cache-control add the max-age to it otherwise if you only set "public" it might invalidate any other "max-age" set in other headers like "Expires". So the header should look like: Cache-control: public, max-age=31536000 ( cache even for HTTPS and authenticated (HTTP authentication) users for a year )

5) Gzip caching

If you're using apache then it's probably already using mod_deflate to compress static files when talking to browsers that accept deflate as the Content-encoding. This is good as it speeds up page loading a bit but this means that apache it's compressing the same content over and over for each visitor consuming your CPU time. And even if you do caching as mentioned above it will still compress for new visitors. So why not cache the compressed content once and server it to everybody?

To do that you'll have to use mod_gzip . This apache module will negotiate Content-encoding with browsers and if the browser supports it then it will send the compressed file instead of the non compressed one. mod_gzip will do even more, it will pre-compress the files so you don't have to do it yourself and it can figure out by itself when you updated the original file and it will regenerate the compressed version. mod_gzip can really save a lot of cpu time for your server.

6) Caching Dynamic content

This basically means generate static content from your dynamic one and save it on disk ( plain and compressed ... see #5 ) so apache or a script can serve it directly without having to go to the database or compute the results . Wp-super-cache does something like this for wordpress.

Since dynamic content is more likely to change often and it's most likely not referenced from other non cachable pages like images,css and JavaScript you can't set a high cache max age for it so you can't reduce the hits so much.

But if you serve it through a script that can easily ( cheaply ) determine that the content has not changed then that script can issue a "304 Not Modified" response and the browser will know that it already has the content. This may be a lot faster then actually regenerating the dynamic content and sending it to the client.

Here's how to do dynamic content caching in PHP

There's also a lot of caching that can be done at the database server level or before/after talking to the database server ( memcached ) but this is totally different topic.

What else ?

Did I miss anything ? If you know other techniques I'd love to read about them so feel free to hit the comments but not too hard as this blog doesn't do much of the caching discussed here :)

BTW: that big web app I mentioned at the beginning of this post is an email marketing service that I just launched in beta. If you run a blog and you think about sending a newsletter you might want to try it. Beta testers get some nice benefits.