Resolving recurring “SegmentNotFoundException” errors in AEM Publish instances

A recurring SegmentNotFoundException error causes Adobe Experience Manager (AEM) Publish instances to crash and become unresponsive. Repository checks show no corruption, and the issue persists after restarts. The problem occurs during maintenance tasks like revision cleanup and data store garbage collection due to improper thread handling. Switching thread management from Java native ThreadPool to Sling-provided ThreadPoolManager resolves the issue.

Description description

Environment

Adobe Experience Manager (AEM) On-Premises, v6.5.22.0

Issue/Symptoms

  • The SegmentNotFoundException error appears repeatedly in the AEM Publish instance logs.
  • The AEM Publish instance crashes and becomes completely unresponsive.
  • Continuous error log generation causes severe disk I/O blockage.
  • Repository consistency checks using oak-run report no corruption or structural issues.
  • The issue recurs after restarting AEM, even when maintenance tasks are scheduled outside business hours or temporarily disabled.

Resolution resolution

Follow these steps to resolve the issue:

  1. Review session management in your custom code and close all repository sessions after use.
  2. Schedule revision cleanup and data store garbage collection outside business hours to reduce conflicts.
  3. Access the JMX console at /system/console/jmx and check SessionStatistics MBeans for long-running or inactive sessions. Use InitStackTrace to identify custom code origins.
  4. Replace Java native ThreadPool with Sling-provided ThreadPoolManager for all background processes in AEM services.
  5. Restart the AEM Publish instance after implementing these changes.
  6. Monitor error logs for several days to confirm SegmentNotFoundException no longer appears.
  7. Verify repository operations remain stable and error-free.

Additional Considerations:

  • Repository integrity checks using oak-run consistently report healthy nodes and properties, indicating no structural corruption.
  • The issue isn’t resolved by disabling revision cleanup or running offline compaction; proper thread management is required.
  • Manual node removal options may not apply if FileDataStore is configured; always confirm repository health before attempting manual interventions.
recommendation-more-help
3d58f420-19b5-47a0-a122-5c9dab55ec7f