Revision Cleanup
- Applies to:
- Experience Manager 6.5
- Topics:
- Administering
CREATED FOR:
- Admin
Introduction
Each update to the repository creates a content revision. As a result, with each update, the size of the repository grows. Old revisions must be cleaned up to free disk resources - this is important to avoid uncontrolled repository growth. This maintenance functionality is called Revision Cleanup. It has been available as an offline routine since Adobe Experience Manager (AEM) 6.0.
With AEM 6.3 and higher, an online version of this functionality called Online Revision Cleanup was introduced. Compared to Offline Revision Cleanup where the AEM instance has to be shut down, Online Revision Cleanup can be run while the AEM instance is online. Online Revision Cleanup is turned on by default and it is the recommended way of performing a revision cleanup.
Note: See the Video for an introduction and how to use Online Revision Cleanup.
The revision cleanup process consists of three phases: estimation, compaction, and clean up. Estimation determines whether to run the next phase (compaction) or not based on how much garbage might be collected. During the compaction phase segments and tar files are rewritten leaving out any unused content. The clean up phase then removes the old segments including any garbage that they may contain. The offline mode can usually reclaim more space because the online mode must account for AEM’s working set which retains additional segments from being collected.
For more details regarding Revision Cleanup, see the following links:
Also, you can read the official Oak documentation.
When to use Online Revision Cleanup as opposed to Offline Revision Cleanup?
Online Revision Cleanup is the recommended way of performing revision cleanup. Offline Revision cleanup should be used only on an exceptional basis - for example, before migrating to the new storage format or if you are requested by Adobe Customer Care to do so.
How to Run Online Revision Cleanup
Online Revision Cleanup is configured by default to automatically run once a day on both AEM Author and Publish instances. All you need to do is define the maintenance window during a period with the least user activity. You can configure the Online Revision Cleanup task as follows:
-
In the main AEM window, go to Tools - Operations - Dashboard - Maintenance or point your browser to:
https://serveraddress:serverport/libs/granite/operations/content/maintenance.html
-
Hover over Daily Maintenance Window and click the Settings icon.
-
Enter the desired values (recurrence, start time, end time) and click Save.
Alternatively, if you want to run the revision cleanup task manually, you can:
-
Go to Tools - Operations - Dashboard - Maintenance or browse directly to
https://serveraddress:serverport/libs/granite/operations/content/maintenance.html
-
Click the Daily Maintenance Window.
-
Hover over the Revision Cleanup icon.
-
Click Run.
Running Online Revision Cleanup After Offline Revision Cleanup
The revision cleanup process reclaims old revisions by generations. This means that each time you run revision cleanup a new generation is created and kept on the disk. There is a difference however between the two types of revision cleanup: offline revision cleanup keeps one generation while online revision cleanup keeps two generations. So, when you run online revision cleanup after offline revision cleanup the following happens:
- After the first online revision cleanup run, the repository size doubles. This happens because there are now two generations that are kept on disk.
- During the subsequent runs, the repository will temporarily grow while the new generation is created and then stabilize back to the size it had after the first run, as the online revision cleanup process reclaims the previous generation.
Also, keep in mind that depending on the type and number of commits, each generation can vary in size compared to the previous one, so the final size can vary from one run to the other.
Due to this fact, it is recommended to size the disk at least two or three times larger than the initially estimated repository size.
Full And Tail Compaction Modes
AEM 6.5 introduces two new modes for the compaction phase of the Online Revision Cleanup process:
- The full compaction mode rewrites all the segments and tar files in the whole repository. The subsequent cleanup phase can thus remove the maximum amount of garbage across the repository. Because full compaction affects the whole repository, it requires a considerable amount of system resources and time to complete. Full compaction corresponds to the compaction phase in AEM 6.3.
- The tail compaction mode rewrites only the most recent segments and tar files in the repository. The most recent segments and tar files are those that have been added since the last time either full or tail compaction ran. The subsequent cleanup phase can thus only remove the garbage contained in the recent part of the repository. Because tail compaction only affects a part of the repository, it requires considerably less system resources and time to complete than full compaction.
These compaction modes constitute a trade-off between efficiency and resource consumption: while tail compaction is less effective it also has less impact on normal system operation. In contrast, full compaction is more effective but has a bigger impact on normal system operation.
AEM 6.5 also introduces a more efficient content deduplication mechanism during compaction, which further reduces the on-disk footprint of the repository.
The two charts below, present results from internal laboratory testing that illustrate the reduction of average execution times and the average footprint on disk in AEM 6.5 compared to AEM 6.3:
How To Configure Full and Tail Compaction
The default configuration runs tail compaction on weekdays and full compaction on Sundays. The default configuration can be changed by using the new configuration value full.gc.days
of the RevisionCleanupTask
maintenance task.
When you configure the full.gc.days
value, full compaction runs during the days defined in the value and tail compaction runs during the days that are not defined in the value. For example, if you configure full compaction to run on Sunday then tail compaction runs Monday to Saturday. For example, if you configure full compaction to run every day of the week then tail compaction does not run at all.
Also, consider that:
- Tail compaction is less effective and it has less impact on normal system operations. It is thus intended to be run during business days.
- Full compaction is more effective but also has a bigger impact on normal system operations. It is thus intended to be used off business days.
- Both tail compaction and full compaction should be scheduled to run during off-peak hours.
Troubleshooting
When using the new compaction modes, keep in mind the following:
- You can monitor the input/output (I/O) activity, for example: I/O operations, CPU waiting for IO, commit queue size. This helps determine whether the system is becoming I/O bound and requires upsizing.
- The
RevisionCleanupTaskHealthCheck
indicates the overall health status of the Online Revision Cleanup. It works the same way as in AEM 6.3 and does not distinguish between full and tail compaction. - The log messages carry relevant information about the compaction modes. For example, when Online Revision Cleanup starts, the corresponding log messages indicate the compaction mode. Also, in some corner cases, the system reverts to full compaction when it was scheduled to run a tail compaction and the log messages indicate this change. The log samples bellow indicate the compaction mode and the change from tail to full compaction:
TarMK GC: running tail compaction
TarMK GC: no base state available, running full compaction instead
Known Limitations
Sometimes, alternating between the tail and full compaction modes delays the cleanup process. More precisely, the repository will grow after a full compaction (it doubles in size). The extra space is reclaimed in the subsequent tail compaction, when the repository drops below the pre-full compaction size. Parallel maintenance task executions should also be avoided.
It is recommended to size the disk at least two or three times larger than the initially estimated repository size.
Online Revision Cleanup Frequently Asked Questions
AEM 6.5 Upgrade Considerations
The persistence format of TarMK changes with AEM 6.5. These changes do not require a proactive migration step. Existing repositories go through a rolling migration, which is transparent to the user. The migration process is initiated the first time AEM 6.5 (or related tools) access the repository.
Once the migration to the AEM 6.5 persistence format has been initiated, the repository cannot be reverted to the previous AEM 6.3 persistence format.
Migrating to Oak Segment Tar
In AEM 6.3 changes to the storage format were needed, especially for improving the performance and efficacy of Online Revision Cleanup. These changes are not backwards compatible, and repositories created with the old Oak Segment (AEM 6.2 and previous) must be migrated.
Additional benefits of changing the storage format:
- Better scalability (optimized segment size).
- Faster Data Store Garbage Collection.
- Ground work for future enhancements.
Running Online Revision Cleanup
Offline Revision Cleanup is reclaiming everything but the latest generation compared to latest two generations for Online Revision Cleanup. If there is a fresh repository, Online Revision Cleanup will not reclaim any space when executed for the first time after the Offline Revision Cleanup because there is no generation old enough to be reclaimed.
Also, read the "Running Online Revision Cleanup after Offline Revision Cleanup" section of this chapter.
The factors are:
- Repository size
- Load on the system (requests per minute, specifically write operations)
- Activity pattern (reads versus writes)
- Hardware specifications (CPU performance, Memory, IOPS)
Disk space is continuously monitored during Online Revision Cleanup. Should the available disk space drop below a critical value, the process is canceled. The critical value is 25% of the current disk footprint of the repository and it is not configurable.
Adobe recommends you size the disk at least two or three times larger than the initially estimated repository size.
Free heap space is continuously monitored during the cleanup process. Should the free heap space drop below a critical value, the process is canceled. The critical value is configured through org.apache.jackrabbit.oak.segment.SegmentNodeStoreService#MEMORY_THRESHOLD. The default value is 15%.
Recommendations for minimum compaction heap sizing are not separated from the AEM memory sizing recommendations. Generally: If an AEM instance is sized enough to cope with the use cases and expected payload thereon, the cleanup process obtains enough memory.
- Ensure that it is executed daily.
- Ensure that it is executed during minimal repository activities by configuring the maintenance windows in Operations Dashboard accordingly.
- Scale up system resources (CPU, Memory, I/O).
Revision Cleanup relies on an estimation phase to decide if there is enough garbage to be cleaned. The estimator compares the current size against the size of the repository after it was last compacted. If the size exceeds the configured delta, cleanup runs. The size delta is set at 1 GB. This effectively means that if the repository size did not grow by 1 GB since the last cleanup run, the new revision cleanup iteration is skipped.
Below are the relevant log entries for the estimation phase:
- Revision GC runs: Size delta is N% or N/N (N/N bytes), so running compaction
- Revision GC does not run: Size delta is N% or N/N (N/N bytes), so skipping compaction for now
If there's write concurrency on the system, online revision cleanup might require exclusive write access to be able to commit the changes at the end of a compaction cycle. The system goes into forceCompact mode, as explained in more detail in the Oak documentation. During force compact, an exclusive write lock is acquired to finally commit the changes without any concurrent writes interfering. To limit the impact on response times, a time-out value can be defined. This value is set to one minute by default, which means that if force compact does not complete within one minute, the compaction process is aborted in favor of concurrent commits.
The duration of force compact depends on the following factors:
- hardware: specifically IOPS. The duration decreases with more IOPS.
- segment store size: duration increases with the size of the segment store.
In a cold standby setup, only the primary instance must be configured to run Online Revision Cleanup. On the standby instance, Online Revision Cleanup does not need to be scheduled specifically.
The corresponding operation on a standby instance is the Automatic Cleanup - this corresponds to the cleanup phase of the Online Revision Cleanup. The Automatic Cleanup is run on the standby instance after the execution of the Online Revision Cleanup on the primary instance.
Estimation and compaction phases will not be run on a standby instance.
Offline Revision Cleanup can immediately remove old revisions while Online Revision Cleanup must account for old revisions still being referenced by the application stack. The former can thus remove garbage more aggressively than the latter where the effect is amortized over the course of a few garbage collection cycles.
Also, read the "Running Online Revision Cleanup after Offline Revision Cleanup" section of this chapter.
- On Windows environments, regular file access is always enforced so memory mapped access is not used. As a general advice, all the available RAM should be allocated to the heap and the segmentCache size should be increased. You increase the segmentCache by adding the segmentCache.size option to the org.apache.jackrabbit.oak.segment.SegmentNodeStoreService.config (for example, segmentCache.size=20480). Remember to leave out some RAM for the operating system and other processes.
- On non-Windows environments, increase the size of the physical memory to improve the memory mapping of the repository.
Monitoring Online Revision Cleanup
- Disk space should be monitored when Online Revision Cleanup is enabled. The cleanup does not run or it terminates preemptively when there is insufficient disk space.
- Check the logs for the completion time of the Online Revision Cleanup. It should not take longer than 2 hours.
- Number of checkpoints. If there are more than 3 checkpoints when compaction runs it is recommended to clean up the checkpoints.
You can check if the Online Revision Cleanup has completed successfully by checking the logs.
For example, "TarMK GC #{}: compaction completed in {} ({} ms), after {} cycles
" means the compaction step completed successfully unless preceded by the message "TarMK GC #{}: compaction gave up compacting concurrent commits after {} cycles
", which means there was too much concurrent load.
Correspondingly there is a message "TarMK GC #{}: cleanup completed in {} ({} ms
" for the successful completion of the cleanup step.
Status, progress, and statistics are exposed via JMX (SegmentRevisionGarbageCollection
MBean). For more details about the SegmentRevisionGarbageCollection
MBean, read the following paragraph.
Progress can be tracked via the EstimatedRevisionGCCompletion
attribute of the SegmentRevisionGarbageCollection MBean.
You can obtain a reference of the MBean using the ObjectName org.apache.jackrabbit.oak:name="Segment node store revision garbage collection",type="SegmentRevisionGarbageCollection"
.
The statistics are only available since the last system start. External monitoring tooling could be used to keep the data beyond AEM uptime.
- Online Revision Cleanup has started / stopped
- Online Revision Cleanup is composed of three phases: estimation, compaction, and cleanup. Estimation can force compaction and cleanup to skip if the repository does not contain enough garbage. In the latest version of AEM, the message "
TarMK GC #{}: estimation started
" marks the start of estimation, "TarMK GC #{}: compaction started, strategy={}
" marks the start of compaction and "TarMK GC #{}: cleanup started. Current repository size is {} ({} bytes
" marks the start of cleanup.
- Online Revision Cleanup is composed of three phases: estimation, compaction, and cleanup. Estimation can force compaction and cleanup to skip if the repository does not contain enough garbage. In the latest version of AEM, the message "
- Disk space gained by the revision cleanup
- Space is reclaimed only when the cleanup phase completes. The completion of the cleanup phase is marked by the log message "T
arMK GC #{}: cleanup completed in {} ({} ms
". Post cleanup size is {} ({} bytes) and space reclaimed {} ({} bytes). Compaction map weight/depth is {}/{} ({} bytes/{}).".
- Space is reclaimed only when the cleanup phase completes. The completion of the cleanup phase is marked by the log message "T
- A problem occurred during the revision cleanup
- There are many failure conditions, all of them are marked by WARN or ERROR log messages staring with "TarMK GC".
Also, see the Troubleshooting Based on Error Messages section below.
TarMK GC #3: cleanup completed
" that includes the size of the repository and the amount of reclaimed garbage.A repository integrity check is not needed after the Online Revision Cleanup.
However, you can perform the following actions to check the repository status after cleanup:
- A repository traversal check
- Use the oak-run tool after the cleanup process has completed to check for inconsistencies. For further info on how to do this, check the Apache Documentation. You do not need to shut down AEM to run the tool.
The Revision Clean-up Health Check is part of the Operations Dashboard.
The status is GREEN if the last execution of the Online Revision Cleanup maintenance task has completed successfully.
It is YELLOW if the Online Revision Cleanup maintenance task was canceled once.
It is RED if the Online Revision Cleanup maintenance task was canceled three times in a row. In this case manual interaction is required or Online Revision Clean-up is likely to fail again. For more information, read the Troubleshooting section below.
Also, the Health Check status is reset after a system restart. So, a freshly restarted instance shows a green status on the Revision Cleanup Health Check. External monitoring tooling could be used to keep the data beyond AEM uptime.
Status, progress, and statistics are exposed via JMX by using the SegmentRevisionGarbageCollection
MBean. See also the following Oak documentation.
You can obtain a reference of the MBean by using the ObjectName org.apache.jackrabbit.oak:name="Segment node store revision garbage collection",type="SegmentRevisionGarbageCollection"
.
The statistics are available only since the last system start. External monitoring tooling could be used to keep the data beyond AEM uptime.
The log files can also be used to check the status, progress, and statistics of the Automatic Cleanup.
- Disk space should be monitored when the Automatic Cleanup is run.
- Completion time (via the logs) to ensure that 2 hours are not exceeded.
- Segmentstore size after the Automatic Cleanup has run. The size of the segmentstore on the standby instance should be approximately the same as the one on the primary instance.
Troubleshooting Online Revision Cleanup
You can take several steps to find and fix the issue:
-
First, check the log entries
-
Depending on the information in the logs, take appropriate action:
- If the logs show five missed compact cycles and a timeout on the
forceCompact
cycle, schedule the maintenance window to a quiet time when the amount of repository writes is low. You can check repository writes in the repository metrics monitoring tool at https://serveraddress:serverport/libs/granite/operations/content/monitoring/page.html - If the cleanup stopped at the end of the maintenance window, make sure the configuration of the maintenance window in the Maintenance Tasks user interface is large enough
- If available heap memory is not sufficient, make sure that instance has enough memory.
- If there is a late reaction, the segmentstore might grow too much for Online Revision Cleanup to complete even within a longer maintenance window. For example, if there was no successful Online Revision Cleanup completed in the last week then it is recommended to plan an offline maintenance and to run Offline Revision Cleanup to bring the segmenstore back to a manageable size.
- If the logs show five missed compact cycles and a timeout on the
SegmentNotFoundException
instances to be logged in the error.log
and how can I recover?A SegmentNotFoundException
is logged by the TarMK when it tries to access a storage unit (a segment) that it cannot find. There are three scenarios that could cause this issue:
- An application that circumvents the recommended access mechanisms (like Sling and the JCR API) and uses a lower-level API/SPI to access the repository and then exceeds the retention time of a segment. That is, it keeps a reference to an entity longer than the retention time allowed by the Online Revision Cleanup (24 hours by default). This case is transient and does not lead to data corruption. To recover, the oak-run tool should be used to confirm the transient nature of the exception (the oak-run check should not report any errors). To do this, the instance must be taken offline and restarted afterwards.
- An external event caused the corruption of the data on the disk. This can be a disk failure, out of disk space or an accidental modification of the required data files. In this case, the instance must be taken offline and repaired using the oak-run check. For more details on how to perform the oak-run check, read the following Apache documentation.
- Address all other occurrences through the Adobe Customer Care.
Troubleshooting Based On Error Messages
The error.log is verbose if there are incidents during the online revision cleanup process. The following matrix aims to explain the most common messages and to provide possible solutions:
How to Run Offline Revision Cleanup
Adobe provides a tool called Oak-run to perform revision cleanup. It can be downloaded at the following location:
https://repo1.maven.org/maven2/org/apache/jackrabbit/oak-run/
The tool is a runnable jar that can be manually run to compact the repository. The process is called offline revision cleanup because the repository must be shut down to properly run the tool. Make sure to plan the cleanup in accordance with your maintenance window.
For tips on how to increase the performance of the cleanup process, see Increasing the Performance of Offline Revision Cleanup.
-
Always make sure you have a recent backup of the AEM instance.
Shut down AEM.
-
(Optional) Use the tool to find old checkpoints:
java -jar oak-run.jar checkpoints install-folder/crx-quickstart/repository/segmentstore
-
(Optional) Then, delete the unreferenced checkpoints:
java -jar oak-run.jar checkpoints install-folder/crx-quickstart/repository/segmentstore rm-unreferenced
-
Run the compaction and wait for it to complete:
java -jar -Dsun.arch.data.model=32 oak-run.jar compact install-folder/crx-quickstart/repository/segmentstore
Increasing the Performance of Offline Revision Cleanup
The oak-run tool introduces several features that aim to increase the performance of the revision cleanup process and minimize the maintenance window as much as possible.
The list includes several command-line parameters, as described below:
-
-mmap. You can set this as true or false. If set to true, memory mapped access is used. If set to false, file access is used. If not specified, memory mapped access is used on 64-bit systems and file access is used on 32-bit systems. On Windows, regular file access is always enforced and this option is ignored. This parameter has replaced the -Dtar.memoryMapped parameter.
-
-Dupdate.limit. Defines the threshold for the flush of a temporary transaction to disk. The default value is 10000.
-
-Dcompress-interval. Number of compaction map entries to keep until compressing the current map. The default is 1000000. You should increase this value to an even higher number for faster throughput, if enough heap memory is available. This parameter has been removed in Oak version 1.6 and has no effect.
-
-Dcompaction-progress-log. The number of compacted nodes that are logged. The default value is 150000, which means that the first 150000 compacted nodes are logged during the operation. Use this with the next parameter documented below.
-
-Dtar.PersistCompactionMap. Set this parameter to true to use disk space instead of heap memory for compaction map persistence. Requires the oak-run tool versions 1.4 and higher. For further details, see question 3 in the Offline Revision Cleanup Frequently Asked Questions section. This parameter has been removed in Oak version 1.6 and has no effect.
-
–force. Force compaction and ignore a non-matching segment store version.
--force
parameter upgrades the segment store to the latest version, which is incompatible with older Oak versions. Also, consider that no downgrade is possible. Generally, you should use these parameters with caution and only if you are knowledgeable about how to use them.An example of the parameters in use:
java -Dupdate.limit=10000 -Dcompaction-progress-log=150000 -Dlogback.configurationFile=logback.xml -Xmx8g -jar oak-run-*.jar checkpoints <repository>
Additional Methods of Triggering Revision Cleanup
In addition to the methods presented above, you can also trigger the revision cleanup mechanism by using the JMX console as follows:
- Open the JMX Console by going to http://localhost:4502/system/console/jmx
- Click the RevisionGarbageCollection MBean.
- In the next window, click startRevisionGC() and then Invoke to start the Revision Garbage Collection job.
Offline Revision Cleanup Frequently Asked Questions
- Oak revision: Oak organizes all the content in a large tree hierarchy that consists of nodes and properties. Each snapshot or revision of this content tree is immutable, and changes to the tree are expressed as a sequence of new revisions. Typically, each content modification triggers a new revision. See also Follow link.
- Page Version: Versioning creates a "snapshot" of a page at a specific point in time. Typically, a new version is created when a page is activated. For more information, see Working with Page Versions.
InMemoryCompactionMap.findEntry
, use the following parameter with the oak-run tool versions 1.4 or higher: -Dtar.PersistCompactionMap=true
. The -Dtar.PersistCompactionMap
parameter has been removed in Oak version 1.6.