How to Run AEM with TarMK Cold Standby how-to-run-aem-with-tarmk-cold-standby
Introduction introduction
The Cold Standby capacity of the Tar Micro Kernel allows one or more standby AEM instances to connect to a primary instance. The sync process is one way only meaning that it is only done from the primary to the standby instances.
The purpose of the standby instances is to guarantee a live data copy of the master repository and ensure a quick switch without data loss in case the master is unavailable for any reason.
Content is synced linearly between the primary instance and the standby instances without any integrity checks for file or repository corruption. Because of this design, standby instances are exact copies of the primary instance and cannot help to mitigate inconsistencies on primary instances.
- OSGI Web Console
How it works how-it-works
On the primary AEM instance, a TCP port is opened and is listening to incoming messages. Currently, there are two type of messages that the slaves will send to the master:
- a message requesting the segmend ID of the current head
- a message requesting segment data with a specified ID
The standby periodically requests the segment ID of the current head of the primary. If the segment is locally unknown it will be retrieved. If it’s already present the segments are compared and referenced segments will be requested too, if necessary.
A typical TarMK Cold Standby deployment:
Other characteristics other-characteristics
Robustness robustness
The data flow is designed to detect and handle connection and network related problems automatically. All packets are bundled with checksums and as soon as problems with the connection or damaged packets occur retry mechanisms are triggered.
Performance performance
Enabling TarMK Cold Standby on the primary instance has almost no measurable impact on the performance. The additional CPU consumption is very low and the extra hard disk and network IO should not produce and performance issues.
On the standby you can expect high CPU consumption during the sync process. Due to the fact that the procedure is not multithreaded it cannot be sped up by using multiple cores. If no data is changed or transferred there will be no measurable activity. The connection speed will vary depending on the hardware and network environment but it does not depend on the size of the repository or SSL use. You should keep this in mind when estimating the time needed for an initial sync or when much data was changed in the meantime on the primary node.
Security security
Assuming that all the instances run in the same intranet security zone the risk of a security breach is greatly reduced. Nevertheless, you can add extra security layer by enabling SSL connections between the slaves and the master. Doing so reduces the possibility that the data is compromised by a man-in-the-middle.
Furthermore you can specify the standby instances that are allowed to connect by restricting the IP address of incoming requests. This should help to garantuee that no one in the intranet can copy the repository.
Creating an AEM TarMK Cold Standby setup creating-an-aem-tarmk-cold-standby-setup
- from org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService to org.apache.jackrabbit.oak.segment.standby.store.StandbyStoreService
- from org.apache.jackrabbit.oak.plugins.segment.SegmentNodeStoreService to org.apache.jackrabbit.oak.segment.SegmentNodeStoreService
In order to create a TarMK cold standby setup, you first need to create the standby instances by performing a file system copy of the entire installation folder of the primary to a new location. You can then start each instance with a runmode that will specify its role ( primary
or standby
).
Below is the procedure that needs to be followed in order to create a setup with one master and one standby instance:
-
Install AEM.
-
Shutdown your instance, and copy its installation folder to the location where the cold standby instance will run from. Even if run from different machines, make sure to give each folder a descriptive name (like aem-primary or aem-standby) to differentiate between the instances.
-
Go to the installation folder of the primary instance and:
- Check and delete any preivous OSGi configurations you might have under
aem-primary/crx-quickstart/install
- Create a folder called
install.primary
underaem-primary/crx-quickstart/install
- Create the required configurations for the prefered node store and data store under
aem-primary/crx-quickstart/install/install.primary
- Create a file called
org.apache.jackrabbit.oak.segment.standby.store.StandbyStoreService.config
in the same location and configure it accordingly. For more information on the configuration options, see Configuration. - If you are using an AEM TarMK instance with an external data store, create a folder named
crx3
underaem-primary/crx-quickstart/install
namedcrx3
- Place the data store configuration file in the
crx3
folder.
If, for example, you are running an AEM TarMK instance with an external File Data Store, you need these configuration files:
aem-primary/crx-quickstart/install/install.primary/org.apache.jackrabbit.oak.segment.SegmentNodeStoreService.config
aem-primary/crx-quickstart/install/install.primary/org.apache.jackrabbit.oak.segment.standby.store.StandbyStoreService.config
aem-primary/crx-quickstart/install/crx3/org.apache.jackrabbit.oak.plugins.blob.datastore.FileDataStore.config
Below you’ll find sample configurations for the primary instance:
Sample of org.apache.jackrabbit.oak.segment.SegmentNodeStoreService.config
code language-xml org.apache.sling.installer.configuration.persist=B"false" customBlobStore=B"true" standby=B"false"
Sample of org.apache.jackrabbit.oak.segment.standby.store.StandbyStoreService.config
code language-xml org.apache.sling.installer.configuration.persist=B"false" mode="primary" port=I"8023"
Sample of org.apache.jackrabbit.oak.plugins.blob.datastore.FileDataStore.config
code language-xml org.apache.sling.installer.configuration.persist=B"false" path="./crx-quickstart/repository/datastore" minRecordLength=I"16384"
- Check and delete any preivous OSGi configurations you might have under
-
Start the primary making sure you specify the primary runmode:
code language-shell java -jar quickstart.jar -r primary,crx3,crx3tar
-
Create a new Apache Sling Logging Logger for the org.apache.jackrabbit.oak.segment package. Set log level to “Debug” and point its log output to a separate logfile, like /logs/tarmk-coldstandby.log. For more information, see Logging.
-
Go to the location of the standby instance and start it by running the jar.
-
Create the same logging configuration as for the primary. Then, stop the instance.
-
Next, prepare the standby instance. You can do this by performing the same steps as for the primary instance:
-
Delete any files you might have under
aem-standby/crx-quickstart/install
. -
Create a new folder called
install.standby
underaem-standby/crx-quickstart/install
-
Create two configuration files called:
org.apache.jackrabbit.oak.segment.SegmentNodeStoreService.config
org.apache.jackrabbit.oak.segment.standby.store.StandbyStoreService.config
-
Create a new folder called
crx3
underaem-standby/crx-quickstart/install
-
Create the data store configuration and place it under
aem-standby/crx-quickstart/install/crx3
. For this example, the file you need to create is:- org.apache.jackrabbit.oak.plugins.blob.datastore.FileDataStore.config
-
Edit the files and create the necessary configurations.
Below are sample configuration files for a typical standby instance:
Sample of org.apache.jackrabbit.oak.segment.SegmentNodeStoreService.config
code language-xml org.apache.sling.installer.configuration.persist=B"false" name="Oak-Tar" service.ranking=I"100" standby=B"true" customBlobStore=B"true"
Sample of org.apache.jackrabbit.oak.segment.standby.store.StandbyStoreService.config
code language-xml org.apache.sling.installer.configuration.persist=B"false" mode="standby" primary.host="127.0.0.1" port=I"8023" secure=B"false" interval=I"5" standby.autoclean=B"true"
Sample of org.apache.jackrabbit.oak.plugins.blob.datastore.FileDataStore.config
code language-xml org.apache.sling.installer.configuration.persist=B"false" path="./crx-quickstart/repository/datastore" minRecordLength=I"16384"
-
-
Start the standby instance by using the standby runmode:
code language-xml java -jar quickstart.jar -r standby,crx3,crx3tar
The service can also be configured via the Web Console, by:
- Going to the Web Console at:
https://serveraddress:serverport/system/console/configMgr
- Looking for a service called Apache Jackrabbit Oak Segment Tar Cold Standby Service and double click it to edit the settings.
- Saving the settings, and restarting the instances so the new settings can take effect.
First time synchronization first-time-synchronization
After the preparation is complete and the standby is started for the first time there will be heavy network traffic between the instances as the standby catches up to the primary. You can consult the logs to observe the status of the synchronization.
In the standby tarmk-coldstandby.log, you will see entries such as these:
*DEBUG* [defaultEventExecutorGroup-2-1] org.apache.jackrabbit.oak.segment.standby.store.StandbyStore trying to read segment ec1f739c-0e3c-41b8-be2e-5417efc05266
*DEBUG* [nioEventLoopGroup-3-1] org.apache.jackrabbit.oak.segment.standby.codec.SegmentDecoder received type 1 with id ec1f739c-0e3c-41b8-be2e-5417efc05266 and size 262144
*DEBUG* [defaultEventExecutorGroup-2-1] org.apache.jackrabbit.oak.segment.standby.store.StandbyStore got segment ec1f739c-0e3c-41b8-be2e-5417efc05266 with size 262144
*DEBUG* [defaultEventExecutorGroup-2-1] org.apache.jackrabbit.oak.segment.file.TarWriter Writing segment ec1f739c-0e3c-41b8-be2e-5417efc05266 to /mnt/crx/author/crx-quickstart/repository/segmentstore/data00016a.tar
In the standby’s error.log, you should see an entry such as this:
*INFO* [FelixStartLevel] org.apache.jackrabbit.oak.segment.standby.store.StandbyStoreService started standby sync with 10.20.30.40:8023 at 5 sec.
In the above log snippet, 10.20.30.40 is the IP address of the primary.
In the primary tarmk-coldstandby.log, you will see entries such as these:
*DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.segment.standby.store.CommunicationObserver got message ‘s.d45f53e4-0c33-4d4d-b3d0-7c552c8e3bbd’ from client c7a7ce9b-1e16-488a-976e-627100ddd8cd
*DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.segment.standby.server.StandbyServerHandler request segment id d45f53e4-0c33-4d4d-b3d0-7c552c8e3bbd
*DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.segment.standby.server.StandbyServerHandler sending segment d45f53e4-0c33-4d4d-b3d0-7c552c8e3bbd to /10.20.30.40:34998
*DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.segment.standby.store.CommunicationObserver did send segment with 262144 bytes to client c7a7ce9b-1e16-488a-976e-627100ddd8cd
In this case, the “client” mentioned in the log is the standby instance.
Once these entries stop appearing in the log, you can safely assume that the syncing process is complete.
While the above entries show that the polling mechanism is functioning properly, it is often useful to understand if there is any data being synchronized as polling is occurring. To do so, look for entries like the following:
*DEBUG* [defaultEventExecutorGroup-156-1] org.apache.jackrabbit.oak.segment.file.TarWriter Writing segment 3a03fafc-d1f9-4a8f-a67a-d0849d5a36d5 to /<<CQROOTDIRECTORY>>/crx-quickstart/repository/segmentstore/data00014a.tar
Additionally, when running with a non shared FileDataStore
, messages like the following will confirm that the binary files are being properly transmitted:
*DEBUG* [nioEventLoopGroup-228-1] org.apache.jackrabbit.oak.segment.standby.codec.ReplyDecoder received blob with id eb26faeaca7f6f5b636f0ececc592f1fd97ea1a9#169102 and size 169102
Configuration configuration
The following OSGi settings are available for the Cold Standby service:
-
Persist Configuration: if enabled, this will store the configuration in the repository instead of the traditional OSGi configuration files. It is recommeded to keep this setting disabled on production systems so that the primary configuration will not be pulled by the standby.
-
Mode (
mode
): this will choose the runmode of the instance. -
Port (port): the port to use for communication. The default is
8023
. -
Primary host (
primary.host
): - the host of the primary instance. This setting is only applicable for the standby. -
Sync interval (
interval
): - this setting determines the interval between sync request and is only applicable for the standby instance. -
Allowed IP-Ranges (
primary.allowed-client-ip-ranges
): - the IP ranges that the primary will allow connections from. -
Secure (
secure
): Enable SSL encryption. In order to make use of this setting, it must be enabled on all instances. -
Standby Read Timeout (
standby.readtimeout
): Timeout for requests issued from the standby instance in milliseconds. The default value used is 60000 (one minute). -
Standby Automatic Cleanup (
standby.autoclean
): Call the cleanup method if the size of the store increases on a sync cycle.
Failover procedures failover-procedures
In case the primary instance fails for any reason, you can set one of the standby instances to take the role of the primary by changing the start runmode as detailed below:
-
Go to the location where the standby instance is installed, and stop it.
-
In case you have a load balancer configured with the setup, you can remove the primary from the load balancer’s configuration at this point.
-
Backup the
crx-quickstart
folder from standby installation folder. It can be used as a starting point when setting up a new standby. -
Restart the instance using the
primary
runmode:code language-shell java -jar quickstart.jar -r primary,crx3,crx3tar
-
Add the new primary to the load balancer.
-
Create and start a new standby instance. For more info, see the procedure above on Creating an AEM TarMK Cold Standby Setup.
Applying Hotfixes to a Cold Standby Setup applying-hotfixes-to-a-cold-standby-setup
The recommended way to apply hotfixes to a cold stanby setup is by installing them to the primary instance and then cloning it into a new cold standby instance with the hotfixes installed.
You can do this by following the steps outlined below:
- Stop the synchronization process on the cold standby instance by going to the JMX Console and using the org.apache.jackrabbit.oak: Status (“Standby”) bean. For more information on how to do this, see the section on Monitoring.
- Stop the cold standby instance.
- Install the hotfix on the primary instance. For more details on how to install a hotfix, see How to Work With Packages.
- Test the instance for issues after the installation.
- Remove the cold standby instance by deleting its installation folder.
- Stop the primary instance and clone it by performing a file system copy of its entire installation folder to the location of the cold standby.
- Reconfigure the newly created clone to act as a cold standby instance. For additional details, see Creating an AEM TarMK Cold Standby Setup.
- Start both the primary and the cold standby instances.
Monitoring monitoring
The feature exposes information using JMX or MBeans. Doing so you can inspect the current state of the standby and the master using the JMX console. The information can be found in an MBean of type org.apache.jackrabbit.oak:type="Standby"
named Status
.
Standby
Observing a standby instance you will expose one node. The ID is usually a generic UUID.
This node has five read-only attributes:
Running:
boolean value indicating whether the sync process is running or not.Mode:
Client: followed by the UUID used to identify the instance. Note that this UUID will change every time the configuration is updated.Status:
a textual representation of the current state (likerunning
orstopped
).FailedRequests:
the number of consecutive errors.SecondsSinceLastSuccess:
the number of seconds since the last successful communication with the server. It will display-1
if no successful communication has been made.
There are also three invokable methods:
start():
starts the sync process.stop():
stops the sync process.cleanup():
runs the cleanup operation on the standby.
Primary
Observing the primary exposes some general information via a MBean whose ID value is the port number the TarMK standby service is using (8023 by default). Most of the methods and attributes are the same as for the standby, but some differ:
Mode:
will always show the valueprimary
.
Furthermore information for up to 10 clients (standby instances) that are connected to the master can be retrieved. The MBean ID is the UUID of the instance. There are no invokable methods for these MBeans but some very useful readonly attributes:
Name:
the ID of the client.LastSeenTimestamp:
the timestamp of the last request in a textual representation.LastRequest:
the last request of the client.RemoteAddress:
the IP address of the client.RemotePort:
the port the client used for the last request.TransferredSegments:
the total number of segments transferred to this client.TransferredSegmentBytes:
the total number of bytes transferred to this client.
Cold Standby Repository Maintenance cold-standby-repository-maintenance
Revision Cleanup revision-clean
cleanup ()
operation on the standby instance will pe performed automatically.Adobe recommends running maintenance on a regular basis to prevent excessive repository growth over time. To manually perform cold standby repository maintenance, follow the steps below:
-
Stop the standby process on the standby instance by going to the JMX Console and using the org.apache.jackrabbit.oak: Status (“Standby”) bean. For more info on how to do this, see the above section on Monitoring.
-
Stop the primary AEM instance.
-
Run the oak compaction tool on the primary instance. For more details, see Maintaining the Repository.
-
Start the primary instance.
-
Start the standby process on the standby instance using the same JMX bean as described in the first step.
-
Watch the logs and wait for synchronization to complete. It is possible that substantial growth in the standby repository will be seen at this time.
-
Run the
cleanup()
operation on the standby instance, using the same JMX bean as described in the first step.
It may take longer than usual for the standby instance to complete synchronization with the primary as offline compaction effectively rewrites the repository history, thus making computation of the changes in the repositories take more time. It should also be noted that once this process completes, the size of the repository on the standby will be roughly the same size as the repository on the primary.
As an alternative, the primary repository can be copied over to the standby manually after running compaction on the primary, essentially rebuilding the standby each time compaction runs.
Data Store Garbage Collection data-store-garbage-collection
It is important to run garbage collection on file datastore instances from time to time as otherwise, deleted binaries will remain on the filesystem, eventually filling up the drive. To run garbage collection, follow the below procedure:
-
Run cold standby repository maintenance as described in the section above.
-
After the maintenance process has completed and the instances have been restarted:
- On the primary, run the data store garbage collection via the relevant JMX bean as described in this article.
- On the standby, the data store garbage collection is available only via the BlobGarbageCollection MBean -
startBlobGC()
. The RepositoryManagement MBean is not available on the standby.
note note NOTE In case you are not using a shared data store, garbage collection will first have to be run on primary and then on the standby.