Tag Archives: Web

7 Methods to cache web applications

The best web caching system is the one that allows visitors to use your site or web application without fetching anything from your server ... well almost anything.

By fetching as little as possible your server gets less hits so it minimizes the load and the need to acquire new hardware and complicated setups but this also improves user's experience a lot because the web application will load a lot faster since most files ( scripts, css, images ) are already on her/his disk.

The idea is to set such a high cache expiry time or ( max-age and other parameters ) that they ( browsers ) would not even want to look for newer versions for a long time ( like a year or more)

Here's what I learned recently when trying to optimize a big web application built on javascript  and php:

0) Page analysis

Before you get started get Page Speed or Yslow and do an analysis on your app/site then come back here and see how you can solve the caching problems listed there.

1) High cache age is good but what do you do when your site changes?

You definitely want to force the changes to your users right ?

Answer: version everything.

You may have noticed the way a lot of sites include scripts and css files with a version at the end like : jquery.js?ver=1232442

Here's how this works: The main page that includes this script is not cached so the visitor will load it every time but the browser caches the jquery.js?ver=1232442 url ( because you said so in your web server config ).

Now if you update jquery to a new version all you have to do is modify the url like jquery.js?ver=1232443 in the main page and the browser will know it has to fetch the jquery.js file again because from it's point of view it's a totally different file.

If you can use php in the template that outputs the page you could even do something like:
<script src="jquery.js?ver=<?=filemtime('jquery.js')?> . By doing this you don't have to worry to update the main page when you update jquery.js.

2) CSS/HTML rewriting.

So you do this versioning thing for javascripts and maybe css files but what do you do about images? How do you cache them and still make sure your visitors will always see the latest version?

Your images are referenced from the css files or HTML content. You probably already serve your HTML content through a script ( cms ? ) so you'll have to modify this script to automatically add the versioning string to each image or other static file you want cached.

For css do the same, serve it through a script and before you output modify the image paths or even better, especially if you use multiple css files, write a script that generates one file from all ( it loads faster this way ), does the rewrite, then minifies it, save it, then it compresses it and stores the compressed version so you can serve that when possible. You would have to run this script every time you change something to your css code.

3) HTTP proxies cache differently.

It is believed that most will not cache URLs with query strings in them like jquery.js?ver=122323

A HTTP proxy can minimize the hits on your server by fetching only once and distributing to more then one user but if you want to take advantage of that you have to use a different versioning scheme.

An idea is to insert the version before the file extension like: jquery-122323.js so the URLs would no look "dynamic" anymore.

If you do this and you don't actually want to rename all the files you could use some mod_rewrite rules to redirect anything matching that pattern to the actual files.

4) HTTPS is a different animal

Yeah browsers will not cache content that comes over https because it's considered a security issue. Imagine your app generates a pdf or image with sensitive user info and "says" yeah you can cache it for a year, and the user downloads it in a publicly available computer. The next user will get the same file. Of course this would happen over HTTP too so be careful with what you allow to be cached. The only difference with HTTPS is that the browser will disregard normal caching instructions if the file is served over HTTPS.

Now you would say "why would you even want to send generic scripts, css or images over https?" ...right ... Well you do because if you allow HTTPS access to your app and you don't send everything over HTTPS then the browsers would warn the user that not everything on the page is encrypted. Now some users wouldn't care especially if they know what the warning means or how to check what's not encrypted, but other's might freak out about it.

So if you want to send everything over HTTPS and you want the browser to cache the files you have to set  the header "Cache-control: public" but again....make sure you only set this for static files that are generic for all users.

And if you set Cache-control add the max-age to it otherwise if you only set "public" it might invalidate any other "max-age" set in other headers like "Expires". So the header should look like: Cache-control: public, max-age=31536000 ( cache even for HTTPS and authenticated (HTTP authentication) users for a year )

5) Gzip caching

If you're using apache then it's probably already using mod_deflate to compress static files when talking to browsers that accept deflate as the Content-encoding. This is good as it speeds up page loading a bit but this means that apache it's compressing the same content over and over for each visitor consuming your CPU time. And even if you do caching as mentioned above it will still compress for new visitors. So why not cache the compressed content once and server it to everybody?

To do that you'll have to use mod_gzip . This apache module will negotiate Content-encoding with browsers and if the browser supports it then it will send the compressed file instead of the non compressed one. mod_gzip will do even more, it will pre-compress the files so you don't have to do it yourself and it can figure out by itself when you updated the original file and it will regenerate the compressed version. mod_gzip can really save a lot of cpu time for your server.

6) Caching Dynamic content

This basically means generate static content from your dynamic one and save it on disk ( plain and compressed ... see #5 ) so apache or a script can serve it directly without having to go to the database or compute the results . Wp-super-cache does something like this for wordpress.

Since dynamic content is more likely to change often and it's most likely not referenced from other non cachable pages like images,css and JavaScript you can't set a high cache max age for it so you can't reduce the hits so much.

But if you serve it through a script that can easily ( cheaply ) determine that the content has not changed then that script can issue a "304 Not Modified" response and the browser will know that it already has the content. This may be a lot faster then actually regenerating the dynamic content and sending it to the client.

Here's how to do dynamic content caching in PHP

There's also a lot of caching that can be done at the database server level or before/after talking to the database server ( memcached ) but this is totally different topic.

What else ?

Did I miss anything ? If you know other techniques I'd love to read about them so feel free to hit the comments but not too hard as this blog doesn't do much of the caching discussed here 🙂

BTW: that big web app I mentioned at the beginning of this post is an email marketing service that I just launched in beta. If you run a blog and you think about sending a newsletter you might want to try it. Beta testers get some nice benefits.

XML Sitemaps for Pligg

Update: There is be a new version of this module. Click here to get it.

I created a module that generates XML Sitemaps for Pligg ( the well known open source cms used for creating sites similar to digg.com ).

The module generates a sitemap index and sitemaps with all the stories in the database dynamically so nothing is stored on disk and you don't have to set a cron job to generate it.

The sitemaps are updated automatically when a new story is submitted. Because of the structure of the sitemap index and because it contains "lastmod" info, the search engines should only download the latest entries in the index so you shouldn't worry about the module putting too much load on your system.

There is also a "ping" function that will announce google, yahoo and ask.com every time a new story is submitted so that they know they have to download the sitemap. The ping function required a little patching to pligg source code to add some hooks ( only if you use 9.6, 9.7 should already have those hooks ). Here is the diff file in case you use pligg 0.9.6 : pligg submit hooks diff

The module was only tested on pligg 0.9.6, I haven't upgraded to 0.9.7 yet, so if you try this on 0.9.7 let me know how it works, any feedback is appreciated.

Download:

You can download Xml_Sitemaps module from here: xml_sitemaps-0.1.tar.gz and in case you want to verify it here is the md5 sum and the sha256 sum

the code is released under the same license as pligg, so feel free to use it, modify and share.

Installation:

This is pretty straight forward, you have to install this like any other pligg module, just put the .tar.gz file in the modules, un-archive it then go into pligg admin and activate it. If you use pligg 0.9.6 and you want to be able to ping the search engines don't forget to apply the submit hooks patch .

Configuration:

After installation you should be able to access the sitemap index like this : http://yourdomain.com/module.php?module=xml_sitemaps_show_sitemap or if you want the sitemap to look friendly ( btw ask.com will only accept a friendly sitemap ending in .xml ) , you just have to go into Admin->Configuration->XmlSitemaps and enable "Sitemap Friendly URL", and if you do that then you have to put the following lines in your .htaccess somewhere before the line "##### URL Method 2 ("Clean" URLs) Begin #####" :

  1. RewriteRule ^sitemapindex.xml module.php?module=xml_sitemaps_show_sitemap [L]
  2. RewriteRule ^sitemap-([a-zA-Z0-9]+).xml module.php?module=xml_sitemaps_show_sitemap&i=$1 [L]

Here is how the index looks on a site with sitemap friendly urls enabled: http://sapa.ro/sitemapindex.xml

There are other configuration options in there, you can set the maximum number of stories to put in a sitemap, and you can chose whether to ping any of the three search engines supported. You can also set your yahoo.com key in there if you want to ping yahoo.

That's it! Happy Sitemapping! and as always ... let me know how it works in the comments.

No browser supporting socks5 authentication?

If you're trying to use a socks server with Internet Explorer , Firefox, Opera or Safari everything will work just fine, except for authentication.

From my point of view this is a big problem. Who in the world would leave such a proxy server unprotected? Yeah of course you can always limit access to a proxy server based on ip address, but in some cases ( see NAT ) this is just not going to work.

Internet explorer supports only the socks4 protocol which doesn't even support full password authentication ( only username and it defaults to the current logged in username ) .

Firefox supports socks5 but no authentication mechanism so supporting socks5 is pretty much useless. I think I saw some ticket in bugzilla about this but no one managed to commit a fix yet.

Opera doesn't even support socks protocol but I thought I should mention all major browsers 🙂

Safari supports SOCKS5 and even allows you to set a username and password to access the SOCKS server but it does not use them.

I tried Konqueror, but I was unable to specify the Socks server, I guess this is because it was not compiled with a socks library.  Has anyone had any success with Konqueror and Socks ?

css class names in IE 6

I just realized today that IE 6 class names must not start with something else other then [a-zA-Z]. So don't you dare name your classes  like _class1 cause it will not work in IE6.

It works well in Firefox 2.0, IE7, opera and safari and that will just make it harder when you'll try to discover the problem.

I know this seems like a lame mistake and something any designer should know, but well I'm not really a designer and some things you never know unless they hit you.

TLS for HTTP

On my previous post about wildcard ssl I was complaining that you have to use a different ip for each domain that needs ssl/https and I wondered why there is no TLS feature like there is in SMTPS where you have STARTTLS. Well it seems I was wrong. There is such a feature, actually there are two different features one is described in RFC2817 and the other in RFC3546. Rfc 2817 specifies how a plain text connection can be "upgraded" to a secured connection over SSL:

This allows unsecured and secured HTTP traffic to share the same well known port (in this case, http: at 80 rather than https: at 443). It also enables "virtual hosting", so a single HTTP + TLS server can disambiguate traffic intended for several hostnames at a single IP address

RFC 3546 various extensions to TLS and one of them is an extension for server name indication . This extension will allow a client to tell the server which domain is contacting.
That's just great, but there's one problem. Not only that few web server software implement any of the two rfcs but also few web browsers support them.

Apache implements rfc 2817 in mod_ssl since version 2.1 and mod_gnutls implements the server name indication extension in TLS described in rfc 3546.
It seems that IE7 has support for RFC 3546 and firefox may have support for rfc 2817.

apache and wildcard ssl

Today I had a client that wanted me to install a wildcard certificate on his new server. A small job, few minutes I said. Only that it was not really like that. The client had this situation. He had one domain foo.tld and the certificate was for *.foo.tld and a lot of subdomains bar.foo.tld apple.foo.tld and others like that. On each domain there was a different site and the client wanted each domain on to be available over SSL because he had a wildcard certificate for *.foo.com. But the problem was that he only had one ip on that server. From what I read in the apache documentation such thing would be impossible . It turns out it's not impossible if you have a wildcard certificate. The ssl FAQ specified two workarounds, #1 use different ports ( not really an option in most cases if you're thinking about serious web business ), and #2 use different ips for each vhost, this may be expensive, hard to get from some ISP, or hard to manage if you have hundreds of domains. I think there should be a line there saying that if you have a windcard ssl you will not need different ips or ports.

Continue reading

Hello world!

This is my first post. Wordress is great.

I decided to start a blog about programming, *nix and other things related to web and internet.

Why patchlog?

Because I will be posting and commenting patches to various open source software from time to time, or because i will be expressing my opinion about how some open source software should work and maybe provide patched to make it work like that.