Chained Caching
Overview
Data Flow
Delivering a page from a server to a client’s browser crosses a multitude of systems and subsystems. If you look carefully, there is a number of hops data needs to take from the source to the drain, each of which is a potential candidate for caching.
Data flow of a typical CMS application
Let’s start our journey with a piece of data that sits on a hard disk and that needs to be displayed in a browser.
Hardware and Operating System
First, the hard disk drive (HDD) itself has some built in cache in the hardware. Second, the operating system that mounts the hard disk, uses free memory to cache frequently accessed blocks to speed up access.
Content Repository
Next level is the CRX or Oak – the document database used by AEM. CRX and Oak divide the data into segments that can be cached in memory as well to avoid slower access to the HDD.
Third Party Data
Most larger Web installations have 3rd party data as well; data coming from a product information system, a customer relation management system, a legacy database or any other arbitrary web service. This data does not need to be pulled from the source any time it is needed – especially not, when it is known to change not too frequently. So, it can be cached, if it is not synchronized in the CRX database.
Business Layer – App / Model
Usually your template scripts do not render the raw content coming from CRX via the JCR API. Most likely you have a business layer in between that merges, calculates and/or transforms data in a business domain object. Guess what – if these operations are expensive, you should consider caching them.
Markup Fragments
The model now is the base for the rendering of the markup for a component. Why not cache the rendered model as well?
Dispatcher, CDN and other Proxies
Off goes the rendered HTML-Page to the Dispatcher. We already discussed, that the main purpose of the Dispatcher is to cache HTML pages and other web resources (despite its name). Before the resources reaches the browser, it may pass a reverse proxy – that can cache and a CDN – that also is used for caching. The client may sit in an office, that grants web access only via a proxy – and that proxy might decide to cache as well to save traffic.
Browser Cache
Last but not least – the browser caches too. This is an easy overlooked asset. But it is the closest and fastest cache you have in the caching chain. Unfortunately – it is not shared between users – but still between different requests of one user.
Where to Cache and Why
That is a long chain of potential caches. And we all have faced issues where we have seen outdated content. But taking into account how many stages there are, it’s a miracle that most of the time it’s working at all.
But where in that chain does it make sense to cache at all? At the beginning? At the end? Everywhere? It depends… and it depends on a huge number of factors. Even two resources in the same website might desire a different answer to that question.
To give you a rough idea of what factors you might take into consideration,
Time to live – If objects have a short inherent live time (traffic data might have a shorter live than weather data) it might not be worth caching.
Production Cost – How expensive (in terms of CPU cycles and I/O) is the re-production and delivery of an object. If it’s cheap caching might not be necessary.
Size – Large objects require more resources to be cached. That could be a limiting factor and must be balanced against the benefit.
Access frequency – If objects are accessed rarely, caching might not be effective. They would just go stale or be invalidated before they are access the second time from cache. Such items would just block memory resources.
Shared access – Data that is used by more than one entity should be cached further up the chain. Actually, the caching chain is not a chain, but a tree. One piece of data in the repository might be used by more than one model. These models in turn can be used by more than one render script to generate HTML fragments. These fragments are included in multiple pages which are distributed to multiple users with their private caches in the browser. So “sharing” does not mean sharing between people only, rather between pieces of software. If you want to find a potential “shared” cache, just track back the tree to the root and find a common ancestor – that’s where you should cache.
Geospatial distribution – If your users are distributed over the world, using a distributed network of caches might help reduce latency.
Network bandwidth and latency – Speaking of latency, who are your customers and what kind of network are they using? Maybe your customers are mobile customers in an under-developed country using 3G connection of older-generation smartphones? Consider creating smaller objects and cache them in the browser caches.
This list by far is not comprehensive, but we think you get the idea by now.
Basic Rules for Chained Caching
Again – caching is hard. Let us share some ground rules, that we have extracted from previous projects that can help you avoid issues in your project.
Avoid Double Caching
Each of the layers introduced in the last chapter provides some value in the caching chain. Either by saving computing cycles or by bringing data closer to the consumer. It is not wrong to cache a piece of data in multiple stages of the chain – but you should always consider what the benefits and the costs of the next stage is. Caching a full page in the Publish system usually does not provide any benefit - as this is done in the Dispatcher already.
Mixing Invalidation Strategies
There are three basic invalidation strategies:
- TTL, Time to Live: An object expires after a fixed amount of time (e.g., “2 hours from now”)
- Expiration Date: The object expires at defined time in the future (e.g., “5:00 PM on June 10, 2019”)
- Event based: The object is invalidated explicitly by an event that happened in the platform (e.g., when a page is changed and activated)
Now you can use different strategies on different cache layers, but there are a few “toxic” ones.
Event Based Invalidation
Pure Event based invalidation: Invalidate from the inner cache to the outer layer
Pure event-based invalidation is the easiest one to comprehend, easiest to get theoretically right and the most accurate one.
Simply put, the caches are invalidated one by one after the object has changed.
You just need to keep one rule in mind:
Always invalidate from the inside to the outside cache. If you invalidated an outer cache first, it might re-cache stale content from an inner one. Don’t make any assumptions at what time a cache is fresh again – make it sure. Best, by triggering the invalidation of the outer cache after invalidating the inner one.
Now, that’s the theory. But in practice there are a number of gotchas. The events must be distributed – potentially over a network. In practice, this makes it the most difficult invalidation scheme to implement.
Auto - Healing
With event-based invalidation, you should have a contingency plan. What if an invalidation event is missed? A simple strategy could be to invalidate or purge after a certain amount of time. So - you might have missed that event and now serve stale content. But your objects also have an implicit TTL of several hours (days) only. So eventually the system auto-heals itself.
Pure TTL-based invalidation
Unsynchronized TTL based invalidation
That one also is a quite common scheme. You stack several layers of caches, each one entitled to serve an object for a certain amount of time.
It’s easy to implement. Unfortunately, it’s hard to predict the effective life span of a piece of data.
Outer cache prolonging the life span of an inner object
Consider the illustration above. Each caching layer introduce a TTL of 2 min. Now – the overall TTL must 2 min too, right? Not quite. If the outer layer fetches the object just before it would get stale, the outer layer actually prolongs the effective live time of the object. The effective live time can be between 2 and 4 minutes in that case. Consider you agreed with your business department, one day is tolerable – and you have four layers of caches. The actual TTL on each layer must not be longer than six hours… increasing the cache miss-rate…
We are not saying it is a bad scheme. You just should know its limits. And it’s a nice and easy strategy to start with. Only if your site’s traffic increases you might consider a more accurate strategy.
Synchronizing Invalidation time by setting a specific date
Expiration Date Based Invalidation
You get a more predictable effective life time, if you are setting a specific date on the inner object and propagating that to the outside caches.
Synchronizing expiration dates
However, not all caches are able to propagate the dates. And it can become nasty, when the outer cache aggregates two inner objects with different expiration dates.
Mixing Event-based and TTL-based invalidation
Mixing event-based and TTL-based strategies
Also a common scheme in the AEM world is to use event based invalidation at the inner caches (e.g., in-memory caches where events can be processed in near real time) and TTL-based caches on the outside – where maybe you don’t have access to explicit invalidation.
In the AEM world you would have an in-memory cache for business objects and HTML fragments in the Publish systems, that is invalidated, when the underlying resources change and you propagate this change event to the dispatcher which also works event-based. In front of that you would have for example a TTL-based CDN.
Having a layer of (short) TTL-based caching in front of a Dispatcher could effectively soften a spike that usually would occur after an auto-invalidation.
Mixing TTL – and Event-Based Invalidation
Toxic: Mixing TTL – and event-based Invalidation
This combination is toxic. Never place and event-based cache after a TTL or Expiry-based cached. Remember that spill-over effect that we had in the “pure-TTL” strategy? The same effect can be observed here. Only that the invalidation event of the outer cache already has happened might not happen again - ever, This can expand the life span of you cached object to infinity.
TTL-based and event-based combined: Spill-over to infinity
Partial Caching and In-Memory Caching
You can hook into stage of the rendering process to add caching layers. From getting remote data transfer objects or creating local business objects to caching the rendered markup of a single component. We will leave concrete implementations to a later tutorial. But maybe you plan to already have implemented a few of these caching layers yourself already. So the least we can do here is to introduce the basic priniciples - and gotchas.
Words of Warning
Respect Access Control
The techniques described here are quite powerful and a must-have in each AEM developer’s toolbox. But don’t get too excited, use them wisely. By storing an object in a cache and sharing it to other users in follow-up requests actually means circumventing access control. That usually is not an issue on public-facing websites but can be, when a user needs to login before getting access.
Consider you store a sites main menu’s HTML markup in an in-memory cache to share it between various pages. Actually that is a perfect example for storing partially rendered HTML as creating a navigation is usually expensive as it requires traversing a lot of pages.
You are not sharing that same menu-structure between all pages but also with all users which makes it even more efficient. But wait … but maybe there are some items in the menu that are reserved for a certain group of users only. In that case caching can get a bit more complex.