How to debug SegmentNotFoundException when Issue is reported in AEM 6.x

Description

How to debug SegmentNotFoundException when Issue is reported in AEM 6.x

A SegmentNotFoundException in the error log means a segment is not present any more although someone apparently tries to still access it. There are broadly three different root causes for this: the segment has been removed by manual intervention (e.g. rm -rf /), the segment has been removed by revision garbage collection or the segment cannot be found due to some bug in the code.

There could be exception like following could be seen in logs:

org.apache.jackrabbit.oak.segment.SegmentNotFoundException: Segment d2c720c4-c146-4ab1-ac37-542aad93c33f not found at
org.apache.jackrabbit.oak.segment.file.FileStore$8.call(FileStore.java:602) at
org.apache.jackrabbit.oak.segment.file.FileStore$8.call(FileStore.java:542) at
org.apache.jackrabbit.oak.segment.SegmentCache.getSegment(SegmentCache.java:95) at
org.apache.jackrabbit.oak.segment.file.FileStore.readSegment(FileStore.java:542) at
org.apache.jackrabbit.oak.segment.SegmentId.getSegment(SegmentId.java:125) at org.apache.jackrabbit.oak.segment.Record.getSegment(Record.java:70) at
org.apache.jackrabbit.oak.segment.MapRecord.compare(MapRecord.java:424) at org.apache.jackrabbit.oak.segment.MapRecord.compare(MapRecord.java:433) at
org.apache.jackrabbit.oak.segment.MapRecord.compare(MapRecord.java:391) at
org.apache.jackrabbit.oak.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:608) at
org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:148) at
org.apache.jackrabbit.oak.segment.MapRecord$3.childNodeChanged(MapRecord.java:442) at
org.apache.jackrabbit.oak.segment.MapRecord.compare(MapRecord.java:490) at org.apache.jackrabbit.oak.segment.MapRecord.compare(MapRecord.java:433) at
org.apache.jackrabbit.oak.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:608) at
org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:52) at
org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.updateIndex(AsyncIndexUpdate.java:695) at
org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.runWhenPermitted(AsyncIndexUpdate.java:543) at
org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:402) at
org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:118) at org.quartz.core.JobRunShell.run(JobRunShell.java:202) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

Resolution

There are 2 approaches we can follow to fix the issue and remove the inconsistencies in the repository described below in sections A and B.

A. Revert to the last known good revision of the segment store.

Firstly we would need to use the oak run tool which is a runnable jar1 that contains everything you need for a simple Oak installation and performing oak-related operations.

The check run-mode of oak-run can be used to determine the last known good revision of a segment store.  That can be used to manually revert a corrupt segment store to its latest good revision.

Caution: This process will roll back the data in the system to a previous point in time.  If you would like to avoid losing changes in your system then you can try option B below instead.

To perform the check and restore:

  1. Download a version of oak-run that matches your oak core version from https://mvnrepository.com/artifact/org.apache.jackrabbit/oak-run

  2. To revert a corrupt segment store to its latest good state change into CQ’s working directory (the one containing the crx-quickstartfolder) and backup all files in ./crx-quickstart/repository/segmentstore/.

  3. Run the consistency check,

    java -Xmx6000m -jar oak-run-\*.jar check --bin=-1 /path/to/crx-quickstart/repository/segmentstore
    

    This searches backwards through the revisions until it finds a consistent one:

    Look for message like below:

    main INFO o.a.j.o.p.s.f.t.ConsistencyChecker - Found latest good revision afdb922d-ba53-4a1b-aa1b-1cb044b535cf:234880
    
  4. Revert the repository to this revision by editing ./crx-quickstart/repository/segmentstore/journal.log and deleting all lines after the line containing the latest good revision.

  5. Remove all ./crx-quickstart/repository/segmentstore/\*.bak files.

  6. Run checkpoint clean-up to remove orphaned checkpoints by following command:

    java -Xmx6000m -jar oak-run-\*.jar checkpoints /path/to/crx-quickstart/repository/segmentstore rm-unreferenced
    
  7. Finally compact the repository:

    java -Xmx6000m -jar oak-run-\*.jar compact /path/to/crx-quickstart/repository/segmentstore/
    

There might be cases when the oak run check is not able to find the good revision and we get “ConsistencyChecker - No good revision found” while running the check command.

How to fix corruption when encountering “ConsistencyChecker - No good revision found” on consistency check

B. Remove corrupted nodes manually.

In AEM, TarMK setups with no FileDatastore configured and situations where corruption is in the binaries, you can do the following.

*Caution:* The procedure below is intended for power users. When deleting the corrupt nodes you need to make sure they are not system nodes (such as /home, /jcr:system, etc). Or if they are system nodes then you need to make sure you could restore them.  Please consult with AEM Customer Care team for assistance with the steps documented here if you are unsure.

  1. Stop AEM.

  2. Use the Oak run console and load the childCount groovy script to identify the corrupt nodes in the segment store:

    Load the oak-run console shell:

    java -jar oak-run-\*.jar console crx-quickstart/repository/segmentstore
    

    Run the two commands below in the shell to load the script and run it:

    :load https://gist.githubusercontent.com/stillalex/e7067bcb86c89bef66c8/raw/d7a5a9b839c3bb0ae5840252022f871fd38374d3/childCount.groovy
    countNodes(session.workingNode)
    

    This results in the following output indicating the path to the corrupted node(s):

    21:21:42.029 main ERROR o.a.j.o.p.segment.SegmentTracker - Segment not found: 63ae05a4-b506-445c-baa2-cfa1b13b6e2f. Creation date delta is 3 ms.
    warning unable to read node /content/dam/test.txt/jcr:content/renditions/original/jcr:content
    

    In some cases the issue is linked to binary properties and the childCount groovy script is unable to locate any corrupted nodes.  In these cases you can use the following command instead which will read the first 1024 bytes for every binary encountered during the traversal (Please note this command will be slower and should only be used when the one above doesn’t return the expected results):

    countNodes(session.workingNode,true)
    
  3. Remove all the identified corrupted nodes listed in the output of the last command using rmNodes.groovy

    Load the oak-run console shell using below command:

    java -jar oak-run-\*.jar console crx-quickstart/repository/segmentstore
    

    Load the groovy script:

    :load
    https://gist.githubusercontent.com/stillalex/43c49af065e3dd1fd5bf/raw/9e726a59f75b46e7b474f7ac763b0888d5a3f0c3/rmNode.groovy
    

    Run the rmNode command to remove the corrupt node, replace /path/to/corrupt/node with the path to the corrupt node you need to remove.

    rmNode(session, "/path/to/corrupt/node")
    

    Where the corrupt node path is the path obtained in step 2, for example: /content/dam/test.txt/jcr:content/renditions/original/jcr:content/

    Note: When using oak-run.jar version 1.6.13 and up, set --read-write JVM parameter if you run into an error such as:

    / rmNode(session,"/path/to/corrupt/node")    Removing node /path/to/corrupt/node    ERROR java.lang.UnsupportedOperationException:    Cannot write to read-only store    at org.apache.jackrabbit.oak.segment.SegmentWriterBuilder$1.execute (SegmentWriterBuilder.java:171)    at org.apache.jackrabbit.oak.segment.SegmentWriter.writeNode (SegmentWriter.java:318)    at org.apache.jackrabbit.oak.segment.SegmentNodeBuilder.getNodeState (SegmentNodeBuilder.java:111)    at org.apache.jackrabbit.oak.segment.SegmentNodeStore$Commit.init (SegmentNodeStore.java:581)    at org.apache.jackrabbit.oak.segment.SegmentNodeStore.merge (SegmentNodeStore.java:333)    at org.apache.jackrabbit.oak.spi.state.NodeStore$merge.call (Unknown Source)    at groovysh_evaluate.rmNode (groovysh_evaluate:11)
    
  4. Repeat step 3 for all nodes found in step 2.

    This above rmNode command should return true for the corrupt path, which means it deleted it. Make sure these found three corrupt paths get deleted by re-running the rmNode command on those paths. Next run it should return false.

    If still you see that the same paths are there in repository, then use the patched version of oak-run jar i.e. oak-run-1.2.18-NPR-17596

    What does patched version of Oak run Jar Do?

    This version of jar skips unreadable binaries on compaction replacing them with 0-byte binaries and logging the exception and the path to syserr. The thus compacted repository should then pass oak-run check, the node count script and you should also be able to compact it again using a non- patched oak-run.

  5. Perform a checkpoint cleanup by listing checkpoints using below. If there are more than one checkpoint, then clean them up:

    nohup java -Xmx4096m -jar oak-run-1.2.18.jar checkpoints /app/AEM6/author/crx-quickstart/repository/segmentstore rm-allnohup.out &
    
  6. Run an offline compaction.  If you do not know how to run offline compaction then see here.

  7. Start the server wait for indexing completion.

On this page