CloudFlare alters conditional requests from web crawlers

I’ve been using CloudFlare on this site for quite a few months. On the whole I was quite happy with it — it certainly has filtered out a large chunk of visits from spambots and other nasties. There are some quirks however. Here are two conditional requests I made to my home page, along with the response headers (I’ve removed some headers that are not relevant):

Request 1:

GET / HTTP/1.1
Host: rayofsolaris.net
User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
If-Modified-Since: Thu, 02 Aug 2012 05:50:55 GMT
Cache-Control: max-age=0

Response 1:

HTTP/1.1 304 Not Modified
Server: cloudflare-nginx
Date: Fri, 17 Aug 2012 07:18:50 GMT
Cache-Control: max-age=604800
Expires: Thu, 09 Aug 2012 05:50:55 GMT
Last-Modified: Thu, 02 Aug 2012 05:50:55 GMT

* * *

Request 2:

GET / HTTP/1.1
Host: rayofsolaris.net
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1
If-Modified-Since: Thu, 02 Aug 2012 05:50:55 GMT
Cache-Control: max-age=0

Response 2:

HTTP/1.1 304 Not Modified
Server: cloudflare-nginx
Date: Fri, 17 Aug 2012 07:20:13 GMT
Cache-Control: max-age=604800
Expires: Thu, 09 Aug 2012 05:50:55 GMT

The only difference between the two requests was the User-Agent header. In the first request I spoofed Googlebot’s user agent.

Nothing out of the ordinary here. Both requests received identical 304 Not Modified responses from the CloudFlare proxy. But the weird bit is that my web server (the one that the CloudFlare proxy connects to on behalf of the client), saw very different requests. Here are the log entries for the two requests:

Request 1:

[17/Aug/2012:08:18:50 +0100] "GET / HTTP/1.1" 200 1638 "-" 
	"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Request 2:

[17/Aug/2012:08:20:13 +0100] "GET / HTTP/1.1" 304 0 "-" 
	"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1"

Notice that with first the request, my web server returned a 200 response - i.e., it delivered the entire content of the web page. With the second request, it delivered a 304 response (and no content). In both cases, the client received a 304 response. So the CloudFlare proxy is doing something funky in between.

After some investigation, I found that when it sees a Googlebot (or other crawler) user agent, CloudFlare strips the If-Modified-Since header from the request before passing it on to the web server. So my server always receives a non-conditional request. It means that all crawler visits are given a full 200 response (even though that may not be what it receives from the CloudFlare proxy), and that my server is sending out a whole lot more data that it needs to. This is particularly wasteful for items that are fetched frequently, such as feeds being retrieved by Feedfetcher.

I’m not sure why CloudFlare would be intercepting and modifying requests from crawlers in this way. I’ve asked, and will update this post with their response.

In any case, the point to note is that CloudFlare isn’t necessarily cutting the load on your server — it might actually be increasing it! In my own case, this behaviour actually wiped out any benefit gained from static content caching (in terms of data transfer and processor usage) in the last month, and I’ve stopped using CloudFlare because of that.

Edit (5th September 2012): No response from CloudFlare — the support ticket that I raised has been closed, twice, without anyone addressing it.

Edit (12th September 2012): After reopening the ticket, I’ve been told that the issue has been passed on to the engineering team to investigate, but have not been given an explanation of why this is happening, and whether it is supposed to happen (I’m pretty sure it’s a bug).