Data Store Garbage Collection
- Topics:
- Administering
CREATED FOR:
- Admin
When a conventional WCM asset is removed, the reference to the underlying data store record may be removed from the node hierarchy, but the data store record itself remains. This unreferenced data store record then becomes “garbage” that need not be retained. In instances where a number of garbage assets exist, it is beneficial to get rid of them to preserve space and to optimize backup and filesystem maintenance performance.
For the most part, a WCM application tends to collect information but not delete information nearly as often. Although new images are added, even superseding old versions, the version control system still retains the old one and supports reverting to it if needed. Thus the majority of the content we think of as adding to the system is effectively permanently stored. So what is the typical source of “garbage” in the repository that we might want to clean up?
AEM uses the repository as the storage for a number of internal and housekeeping activities:
- Packages built and downloaded
- Temporary files created for publish replication
- Workflow payloads
- Assets created temporarily during DAM rendering
When any of these temporary objects is large enough to require storage in the data store, and when the object eventually passes out of use, the data store record itself remains as “garbage”. In a typical WCM author/publish application, the largest source of garbage of this type is commonly the process of publish activation. When data is being replicated to Publish, it if first gathered in collections in an efficient data format called “Durbo” and stored in the repository under /var/replication/data
. The data bundles are often larger than the critical size threshold for the data store and therefore wind up stored as data store records. When the replication is complete, the node in /var/replication/data
is deleted, but the data store record remains as “garbage”.
Another source of recoverable garbage is packages. Package data, like everything else, is stored in the repository and thus for packages which are larger than 4KB, in the data store. In the course of a development project or over time while maintaining a system, packages may be built and rebuilt many times, each build resulting in a new data store record, orphaning the previous build’s record.
How does data store garbage collection work?
If the repository has been configured with an external data store, data store garbage collection will run automatically as part of the Weekly Maintenance Window. The system administrator can also run data store garbage collection manually on as as-needed basis. In general, it is recommended that data store garbage collection be performed periodically, but that the following factors be taken into account in planning data store garbage collections:
- Data store garbage collections take time and may impact performance, so they should be planned accordingly.
- Removal of data store garbage records does not affect normal performance, so this is not a performance optimization.
- If storage utilization and related factors like backup times are not a concern, then data store garbage collection might be safely deferred.
The data store garbage collector first makes a note of the current timestamp when the process begins. The collection is then carried out using a multi-pass mark/sweep pattern algorithm.
In the first phase, the data store garbage collector performs a comprehensive traversal of all of the repository content. For each content object that has a reference to a data store record, it located the file in the filesystem, performing a metadata update – modifying the “last modified” or MTIME attribute. At this point files that are accessed by this phase become newer than the initial baseline timestamp.
In the second phase, the data store garbage collector traverses the physical directory structure of the data store in much the same way as a “find”. It examined the “last modified” or MTIME attribute of the file and makes the following determination:
- If the MTIME is newer than the initial baseline timestamp, then either the file was found in the first phase, or it is an entirely new file that was added to the repository while the collection process was ongoing. In either of these cases the record is taken to be active and the file shall not be deleted.
- If the MTIME is prior to the initial baseline timestamp, then the file is not an actively referenced file and it is considered removable garbage.
This approach works well for a single node with a private data store. However the data store may be shared, and if it is this means that potentially active live references to data store records from other repositories are not checked, and active referenced files may be mistakenly removed. It is imperative that the system admin understand the shared nature of the data store before planning any garbage collections, and only use the simple built-in data store garbage collection process when it is known that the data store is not shared.
Running Data Store Garbage Collection
There are three ways of running data store garbage collection, depending on the data store setup on which AEM is running:
-
Via Revision Cleanup - a garbage collection mechanism usually used for node store cleanup.
-
Via Data Store Garbage Collection - a garbage collection mechanism specific for external data stores, available on the Operations Dashboard.
-
Via the JMX Console.
If TarMK is being used as both the node store and data store, then Revision Cleanup can be used for garbage collection of both node store and data store. However if an external data store is configured such as File System Data Store, then data store garbage collection must be explicitly triggered separate from Revision Cleanup. Data store garbage collection can be triggered either via the Operations Dashboard or the JMX Console.
The below table shows the data store garbage collection type that needs to be used for all the supported data store deployments in AEM 6:
Node Store | Data Store | Garbage Collection Mechanism |
TarMK | TarMK | Revision Cleanup (binaries are in-lined with Segment Store) |
TarMK | External Filesystem |
Data Store Garbage Collection task via Operations Dashboard JMX Console |
MongoDB | MongoDB |
Data Store Garbage Collection task via Operations Dashboard JMX Console |
MongoDB | External Filesystem |
Data Store Garbage Collection task via Operations Dashboard JMX Console |
Running Data Store Garbage Collection via the Operations Dashboard
The built-in Weekly Maintenance Window, available via the Operations Dashboard, contains a built-in task to trigger the Data Store Garbage Collection at 1 am on Sundays.
If you need to run data store garbage collection outside of this time, it can be triggered manually via the Operations Dashboard.
Before running data store garbage collection you should check that no backups are running at the time.
-
Open the Operations Dashboard by Navigation -> Tools -> Operations -> Maintenance.
-
Click or tap the Weekly Maintenance Window.
-
Select the Data Store Garbage Collection task and then click or tap the Run icon.
-
Data store garbage collection runs and its status is displayed in the dashboard.
Running Data Store Garbage Collection via the JMX Console
This section is about manually running data store garbage collection via the JMX Console. If your installation is set up without an external data store, then this does not apply to your installation. Instead see the instructions on how to run Revision cleanup under Maintaining the Repository.
To run garbage collection:
-
In the Apache Felix OSGi Management Console, highlight the Main tab and select JMX from the following menu.
-
Next, search for and click the Repository Manager MBean (or go to
https://<host>:<port>/system/console/jmx/org.apache.jackrabbit.oak%3Aname%3Drepository+manager%2Ctype%3DRepositoryManagement
). -
Click startDataStoreGC(boolean markOnly).
-
enter “
true
” for themarkOnly
parameter if required:Option Description boolean markOnly Set to true to only mark references and not sweep in the mark and sweep operation. This mode is to be used when the underlying BlobStore is shared between multiple different repositories. For all other cases set it to false to perform full garbage collection. -
Click Invoke. CRX runs the garbage collection and indicates when it has completed.
Cannot perform operation: no service of type BlobGCMBean found
after invoking. See Configuring node stores and data stores in AEM 6 for information on how to set up a file data store.Automating Data Store Garbage Collection
If possible, data store garbage collection should be run when there is little load on the system, for example in the morning.
The built-in Weekly Maintenance Window, available via the Operations Dashboard, contains a built-in task to trigger the Data Store Garbage Collection at 1 am on Sundays. You should also check that no backups are running at this time. The start of the maintenance window can be customized via the dashboard as necessary.
If you don’t wish to run data store garbage collection with the Weekly Maintenance Window in the Operations Dashboard, it can also be automated using the wget or curl HTTP clients. The following is an example of how to automate backup by using curl:
curl
commands various parameters might need to be configured for your instance; for example, the hostname ( localhost
), port ( 4502
), admin password ( xyz
) and various parameters for the actual data store garbage collection.Here is an example curl command to invoke data store garbage collection via the command line:
curl -u admin:admin -X POST --data markOnly=true http://localhost:4503/system/console/jmx/org.apache.jackrabbit.oak"%"3Aname"%"3Drepository+manager"%"2Ctype"%"3DRepositoryManagement/op/startDataStoreGC/boolean
The curl command returns immediately.
Checking Data Store Consistency
The data store consistency check will report any data store binaries that are missing but are still referenced. To start a consistency check, follow these steps:
-
Go to the JMX console. For information on how to use the JMX console, see this article.
-
Search for the Blob GC Mbean and click it.
-
Click the
checkConsistency()
link.
After the consistency check is complete, a message will show the number of binaries reported as missing. If the number is greater than 0, check the error.log
for more details on the missing binaries.
Below you will find an example of how the missing binaries are reported in the logs:
11:32:39.673 INFO [main] MarkSweepGarbageCollector.java:600 Consistency check found [1] missing blobs
11:32:39.673 WARN [main] MarkSweepGarbageCollector.java:602 Consistency check failure in the blob store : DataStore backed BlobStore [org.apache.jackrabbit.oak.plugins.blob.datastore.OakFileDataStore], check missing candidates in file /tmp/gcworkdir-1467352959243/gccand-1467352959243
Experience Manager
- Administering User Guide overview
- Sites Features
- Website Administration
- Reusing Content: Multi Site Manager and Live Copy
- Live Copy Overview Console
- Configuring Live Copy Synchronization
- Creating and Synchronizing Live Copies
- MSM Rollout Conflicts
- MSM Best Practices
- Translating Content for Multilingual Sites
- Managing Translation Projects
- Identifying Content to Translate
- Preparing Content for Translation
- Creating a Language Root Using the Classic UI
- Connecting to Microsoft Translator
- Configuring the Translation Integration Framework
- Language Copy Wizard
- Translation Enhancements
- Translation Best Practices
- Configurations and the Configuration Browser
- AEM FAQs
- Operations
- Dashboards
- Operations Dashboard
- Backup and Restore
- Data Store Garbage Collection
- Monitoring Server Resources Using the JMX Console
- Working with Logs
- Configure the Rich Text Editor
- Configure the Video component
- The Bulk Editor
- Configuring Email Notification
- Configuring RTE for Producing Accessible Sites
- The Link Checker
- Troubleshooting AEM
- Audit Log Maintenance in AEM 6
- Editor
- Managing Access to Workflows
- Using cURL with AEM
- Configuring Undo for Page Editing
- Proxy Server Tool (proxy.jar)
- Configuring for AEM Apps
- Administering Workflows
- Configuring Search Forms
- Tools Consoles
- Reporting
- Administering Workflow Instances
- Configuring Layout Container and Layout Mode
- Enabling Access to Classic UI
- Starting Workflows
- Configure the Rich Text Editor plug-ins
- Admin Consoles
- Security
- User Administration and Security
- User, Group and Access Rights Administration
- Security Checklist
- OWASP Top 10
- Running AEM in Production Ready Mode
- Identity Management
- Adobe IMS Authentication and Admin Console Support for AEM Managed Services
- Creating a Closed User Group
- Mitigating serialization issues in AEM
- User Synchronization
- Encapsulated Token Support
- Single Sign On
- How to Audit User Management Operations in AEM
- SSL By Default
- SAML 2.0 Authentication Handler
- Closed User Groups in AEM
- Granite Operations - User and Group Administration
- Enabling CRXDE Lite in AEM
- Configuring LDAP with AEM 6
- Configure the Admin Password on Installation
- Service Users in AEM
- Encryption Support for Configuration Properties
- Handling GDPR Requests for the AEM Foundation
- Content Disposition Filter
- Personalization
- eCommerce
- Integration
- Integrating with Third-Party Services
- Integrating with Salesforce
- Integrating with Adobe Target
- Integrating with Adobe Analytics
- Connecting to Adobe Analytics and Creating Frameworks
- Configuring Link Tracking for Adobe Analytics
- Mapping Component Data with Adobe Analytics Properties
- Configuring Video Tracking for Adobe Analytics
- HTTP2 Delivery of Content FAQ
- Troubleshooting your Adobe Campaign Integration
- SharePoint Connector Licenses, Copyright Notices, and Disclaimers
- SharePoint Connector
- DHTML Viewer End-of-Life FAQs
- Integrating with Adobe Campaign Classic
- Related Community Articles
- Integrating with Adobe Campaign Standard
- Flash Viewers End-of-Life Notice
- Integrating with Adobe Creative Cloud
- Integrating with Adobe Dynamic Tag Management
- Opting Into Adobe Analytics and Adobe Target
- AEM Portals and Portlets
- Integrating with Dynamic Media Classic
- Troubleshooting Integration Issues
- Integrating with BrightEdge Content Optimizer
- Best Practices for Email Templates
- Catalog Producer
- Integrating with Silverpop Engage
- Integrating with Adobe Campaign
- Integrating with ExactTarget
- Analytics with External Providers
- Integrating with the Adobe Marketing Cloud
- Manually Configuring the Integration with Adobe Target
- Prerequisites for Integrating with Adobe Target
- Adobe Classifications
- Solutions Integration
- Target Integration with Experience Fragments
- Best Practices
- Content Management