Understanding caching
<- Previous: Explanation of Configuration Files
This document will explain how Dispatcher caching happens and how it can be configured
Caching Directories
We use the following default cache directories in our baseline installations
- Author
/mnt/var/www/author
- Publisher
/mnt/var/www/html
When each request traverses the Dispatcher the requests follow the configured rules to keep a locally cached version to response of eligible items
Configuration Files
Dispatcher controls what qualifies as cacheable in the /cache {
section of any farm file.
In the AMS baseline configuration farms, you’ll find our includes like shown below:
/cache {
/rules {
$include "/etc/httpd/conf.dispatcher.d/cache/ams_author_cache.any"
}
When creating the rules for what to cache or not, please refer to the documentation here
Caching author
There are a lot of implementations we’ve seen where people don’t cache author content.
They are missing out a huge upgrade in performance and responsiveness to their authors.
Let’s talk about the strategy taken in configuring our author farm to cache properly.
Here is a base author /cache {
section of our author farm file:
/cache {
/docroot "/mnt/var/www/author"
/statfileslevel "2"
/allowAuthorized "1"
/rules {
$include "/etc/httpd/conf.dispatcher.d/cache/ams_author_cache.any"
}
/invalidate {
/0000 {
/glob "*"
/type "allow"
}
}
/allowedClients {
/0000 {
/glob "*.*.*.*"
/type "deny"
}
$include "/etc/httpd/conf.dispatcher.d/cache/ams_author_invalidate_allowed.any"
}
}
The important things to note here are that the /docroot
is set to the cache directory for author.
DocumentRoot
in the author’s .vhost
file matches the farms /docroot
parameterThe cache rules include statement includes the file /etc/httpd/conf.dispatcher.d/cache/ams_author_cache.any
which contains these rules:
/0000 {
/glob "*"
/type "deny"
}
/0001 {
/glob "/libs/*"
/type "allow"
}
/0002 {
/glob "/libs/*.html"
/type "deny"
}
/0003 {
/glob "/libs/granite/csrf/token.json"
/type "deny"
}
/0004 {
/glob "/apps/*"
/type "allow"
}
/0005 {
/glob "/apps/*.html"
/type "deny"
}
/0006 {
/glob "/libs/cq/core/content/welcome.*"
/type "deny"
}
In an author scenario, content is changing all the time and on purpose. You only want to cache items that are not going to change frequently.
We have rules to cache /libs
because they are part of the baseline AEM install and would change until you have installed a Service Pack, Cumulative Fix Pack, Upgrade, or Hotfix. So caching these elements make a ton of sense and really have huge benefits of the author experience of end users who use the site.
/apps
this is where custom application code lives. If you’re developing your code on this instance then it will prove to be very confusing when you save your file and don’t see if reflect in the UI due to it serving up a cached copy. The intention here is that if you do a deployment of your code into AEM it too would be infrequent and part of your deployment steps should be to clear the author cache. Again the benefit is huge making your cacheable code run faster for the end users.ServeOnStale (AKA Serve on Stale / SOS)
This is one of those gems of a feature of the Dispatcher. If the publisher is under load or has become unresponsive it will typically throw a 502 or 503 http response code. If that happens and this feature is enabled the Dispatcher will be instructed to still serve what ever content is still in the cache as a best effort even if it’s not a fresh copy. It’s better to serve something if you’ve got it rather than just showing an error message that offers no functionality.
This setting can be set in any farm but only makes sense to apply it on the publish farm files. Here is a syntax example of the feature enabled in a farm file:
/cache {
/serveStaleOnError "1"
Caching pages with Query params / Arguments
/content/page.html?myquery=value
) it will skip caching the file and go directly to the AEM instance. It’s considering this request a dynamic page and shouldn’t be cached. This can cause ill effects on cache efficiency.See this article showing how important query parameters can affect your site performance.
By default you want to set the ignoreUrlParams
rules to allow *
. Meaning that all query parameters are ignored and allow all pages to be cached regardless of the parameters used.
Here is an example where someone has built a social media deep link reference mechanism that uses the argument reference in the URI to know where the person came from.
Ignorable Example:
- https://www.we-retail.com/home.html?reference=android
- https://www.we-retail.com/home.html?reference=facebook
The page is 100% cacheable but doesn’t cache because the arguments are present.
Configuring your ignoreUrlParams
as a allow list will help fix this issue:
/cache {
/ignoreUrlParams {
/0001 { /glob "*" /type "allow" }
}
Now when the Dispatcher sees the request it will ignore the fact that the request has the query
parameter of ?
reference and still cache the page
Dynamic Example:
- https://www.we-retail.com/search.html?q=fruit
- https://www.we-retail.com/search.html?q=vegetables
Keep in mind that if you do have query parameters that make a page change it’s rendered output then you’ll need to excempt them from your ignored list and make the page un-cacheable again. For example a search page that uses a query parameter changes the raw html rendered.
So here is the html source of each search:
/search.html?q=fruit
:
<html>
...SNIP...
<div id='results'>
<div class='result'>
Orange
</div>
<div class='result'>
Apple
</div>
<div class='result'>
Strawberry
</div>
</div>
</html>
/search.html?q=vegetables
:
<html>
...SNIP...
<div id='results'>
<div class='result'>
Carrot
</div>
<div class='result'>
Cucumber
</div>
<div class='result'>
Celery
</div>
</div>
</html>
If you visited /search.html?q=fruit
first then it would cache the html with results showing fruit.
Then you visit /search.html?q=vegetables
second but it would show results of fruit.
This is because the query parameter of q
is being ignored in regards to caching. To avoid this issue you’ll need to take note of pages that render different HTML based on query parameters and deny caching for those.
Example:
/cache {
/ignoreUrlParams {
/0001 { /glob "*" /type "allow" }
/0002 { /glob "q" /type "deny" }
}
Pages that use query parameters via Javascript will still fully function ignoring the paramters in this setting. Because they don’t change the html file at rest. They use javascript to update the browsers dom realtime on the local browser. Meaning that if you consume the query parameters with javascript it’s highly likely you can ignore this parameter for page caching. Allow that page to cache and enjoy the performance gain!
Caching response headers
It’s pretty obvious that the Dispatcher caches .html
pages and clientlibs (i.e. .js
, .css
), but did you know it can also cache particular response headers along side the content in a file with the same name but a .h
file extension. This allows the next response to not only the content but the response headers that should go with it from cache.
AEM can handle more than just UTF-8 encoding
Sometimes items have special headers that help control cache TTL’s encoding details, and last modified timestamps.
These values when cached are stripped by default and the Apache httpd webserver will do it’s own job of processing the asset with it’s normal file handling methods, which normally is limited to mime type guessing based on file extensions.
If you have the Dispatcher cache the asset and the desired headers you can expose the proper experience and assure the all the details make it to the clients browser.
Here is an example of a farm with the headers to cache specified:
/cache {
/headers {
"Cache-Control"
"Content-Disposition"
"Content-Type"
"Expires"
"Last-Modified"
"X-Content-Type-Options"
}
}
In the example they have configured AEM to serve up headers the CDN looks for to know when to invalidate it’s cache. Meaning now AEM can properly dictate which files get invalidated based on headers.
Auto-Invalidate Grace Period
On AEM systems that have a lot of activity from authors that do alot of page activations you can have a race condition where repeat invalidations occur. Heavily repeated flush requests are un-necessary and you can build in some tolerance to not repeat a flush until the grace period has cleared.
Example of how this works:
If you have 5 request to invalidate /content/exampleco/en/
all happen within a 3 second period.
With this feature off you’d invalidate the cache directory /content/exampleco/en/
5 times
With this feature on and set to 5 seconds it would invalidate the cache directory /content/exampleco/en/
once
Here is an example syntax of this feature being configured for 5 second grace period:
/cache {
/gracePeriod "5"
TTL Based Invalidation
A newer feature of the Dispatcher module was Time To Live (TTL)
based invalidation options for cached items. When an item gets cached it looks for the presence of cache control headers and generates a file in the cache directory with the same name and a .ttl
extension.
Here is an example of the feature being configured in the farm configuration file:
/cache {
/enableTTL "1"
Cache Filter Rules
Here is an example of a baseline configuration for which elements to cache on a publisher:
/cache{
/0000 {
/glob "*"
/type "allow"
}
/0001 {
/glob "/libs/granite/csrf/token.json"
/type "deny"
}
We want to make our published site greedy as possible and cache everything.
If there is elements that break the experience when cached you can add rules to remove the option to cache that item. As you see in the example above the csrf tokens shouldn’t ever be cached and have been excluded. Further details on writing these rules can be found here
Next -> Using and Understanding Variables