How does data store garbage collection work?
If the repository has been configured with an external data store, data store garbage collection will run automatically as part of the Weekly Maintenance Window. The system administrator can also run data store garbage collection manually on as as-needed basis. In general, it is recommended that data store garbage collection be performed periodically, but that the following factors be taken into account in planning data store garbage collections:
- Data store garbage collections take time and may impact performance, so they should be planned accordingly.
- Removal of data store garbage records does not affect normal performance, so this is not a performance optimization.
- If storage utilization and related factors like backup times are not a concern, then data store garbage collection might be safely deferred.
The data store garbage collector first makes a note of the current timestamp when the process begins. The collection is then carried out using a multi-pass mark/sweep pattern algorithm.
In the first phase, the data store garbage collector performs a comprehensive traversal of all of the repository content. For each content object that has a reference to a data store record, it located the file in the filesystem, performing a metadata update – modifying the “last modified” or MTIME attribute. At this point files that are accessed by this phase become newer than the initial baseline timestamp.
In the second phase, the data store garbage collector traverses the physical directory structure of the data store in much the same way as a “find”. It examined the “last modified” or MTIME attribute of the file and makes the following determination:
- If the MTIME is newer than the initial baseline timestamp, then either the file was found in the first phase, or it is an entirely new file that was added to the repository while the collection process was ongoing. In either of these cases the record is taken to be active and the file shall not be deleted.
- If the MTIME is prior to the initial baseline timestamp, then the file is not an actively referenced file and it is considered removable garbage.
This approach works well for a single node with a private data store. However the data store may be shared, and if it is this means that potentially active live references to data store records from other repositories are not checked, and active referenced files may be mistakenly removed. It is imperative that the system admin understand the shared nature of the data store before planning any garbage collections, and only use the simple built-in data store garbage collection process when it is known that the data store is not shared.