HTTP Cache Validation
- posted:
2025-04-18
I recently worked on getting this website - hosted on sdf.org - indexed by Google, and noticed that the Google crawler properly respects HTTP caching rules [1]:
Google's crawling infrastructure supports heuristic HTTP caching as defined by the HTTP caching standard, specifically through the ETag response and If-None-Match request header, and the Last-Modified response and If-Modified-Since request header.
If both ETag and Last-Modified response header fields are present in the HTTP response, Google's crawlers use the ETag value as required by the HTTP standard. For Google's crawlers specifically, we recommend using ETag instead of the Last-Modified header to indicate caching preference as ETag doesn't have date formatting issues.
In short, setting the ETag response header or the Last-Modified response header allows Google crawlers to determine whether the content of a specific web page has been changed. But what are these two response headers and how do they work?
Basically, when a user visits a given URL for the very first time, a web server responds with the requested resource along with an ETag header, or a Last-Modified header, or both of them. For example:
HTTP/1.1 200 OK ... ETag: "33a64df5" Last-Modified: Tue, 22 Feb 2022 22:00:00 GMT
On subsequent visits to the same URL, the client includes the previously received ETag in the If-None-Match request header, or the previously received Last-Modified in the If-Modified-Since request header, or both of them. For example:
GET /index.html HTTP/1.1 ... If-None-Match: "33a64df5" If-Modified-Since: Tue, 22 Feb 2022 22:00:00 GMT
The server then determines whether to allow the client to reuse the cached resource, by validating the IF-None-Match and IF-Modified-Since request headers. Specifically:
Compare the If-None-Match value in the request with the value of the ETag header it determines for the requested resource, which indicates "no change" if they are same.
Check the date value of the If-Modified-Since request header, which indicates "no change" if the data value is not older than the modified date of the requested resource.
If the resource has not changed, the server sends back a status "304 Not Modified", without a body, which informs the client that the cached version of the response is still valid and can be reused. For example:
HTTP/1.1 304 Not Modified ... ETag: "33a64df5" Last-Modified: Tue, 22 Feb 2022 22:00:00 GMT
Otherwise, the server sends back a status "200 OK", with the latest version of the resource as its body. For example:
HTTP/1.1 200 OK ... ETag: "33a64df5" Last-Modified: Tue, 22 Feb 2022 22:00:00 GMT
As of this writing, this website runs on Apache httpd 2.4.63, the default and only web server provided by sdf.org. To use the ETag response header for this website, I found I didn't need to do anything, as the ETag response header is enabled by default with the setting - FileETag MTime Size [2].
Thanks for reading :)