SegmentNotFoundException and IllegalArgumentException

Description description

Environment
Experience Manager

Issue/Symptoms

Scenario 1
Running an Offline Compaction can fail with  SegmentNotFoundException when there are integrity issues of the repository.

You observe  SegmentNotFoundException in AEM log files and AEM is not working as expected.

Scenario 2

Running an Offline Compaction can fail with  SegmentNotFoundException when there are integrity issues of the repository.

A stack trace similar to the one below shows in the logs:

13:51:21.523 [ main]  ERROR o.a.j.o.p.segment.SegmentTracker - Segment not found: 4d139bc4-150c-4f0a-b82a-40a4e519fe8a. Creation date delta is 4 ms.
org.apache.jackrabbit.oak.plugins.segment.SegmentNotFoundException: Segment 4d139bc4-150c-4f0a-b82a-40a4e519fe8a not found
at org.apache.jackrabbit.oak.plugins.segment.file.FileStore.readSegment(FileStore.java:855) [ oak-run-1.0.22.jar:1.0.22]
at org.apache.jackrabbit.oak.plugins.segment.SegmentTracker.getSegment(SegmentTracker.java:134) ~[ oak-run-1.0.22.jar:1.0.22]
at org.apache.jackrabbit.oak.plugins.segment.SegmentId.getSegment(SegmentId.java:101) [ oak-run-1.0.22.jar:1.0.22]
...
Exception in thread "main" org.apache.jackrabbit.oak.plugins.segment.SegmentNotFoundException: Segment 4d139bc4-150c-4f0a-b82a-40a4e519fe8a not found
at org.apache.jackrabbit.oak.plugins.segment.file.FileStore.readSegment(FileStore.java:855)
at org.apache.jackrabbit.oak.plugins.segment.SegmentTracker.getSegment(SegmentTracker.java:134)
at org.apache.jackrabbit.oak.plugins.segment.SegmentId.getSegment(SegmentId.java:101)
...

Scenario 3

Running an Offline Compaction can fail with  IllegalArgument Exception when there are integrity issues of the repository.

A stack trace similar to the one below shows in the logs:

java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:77)
at org.apache.jackrabbit.oak.plugins.segment.ListRecord.(ListRecord.java:41)
at org.apache.jackrabbit.oak.plugins.segment.ListRecord.getEntry(ListRecord.java:64)
at org.apache.jackrabbit.oak.plugins.segment.ListRecord.getEntries(ListRecord.java:81)
at org.apache.jackrabbit.oak.plugins.segment.SegmentStream.read(SegmentStream.java:153)
at org.apache.jackrabbit.oak.commons.IOUtils.readFully(IOUtils.java:53)
at org.apache.jackrabbit.oak.plugins.segment.Compactor.getBlobKey(Compactor.java:412)
at org.apache.jackrabbit.oak.plugins.segment.Compactor.compact(Compactor.java:362)
at org.apache.jackrabbit.oak.plugins.segment.Compactor.compact(Compactor.java:321)
at org.apache.jackrabbit.oak.plugins.segment.Compactor.access$500(Compactor.java:54)
at org.apache.jackrabbit.oak.plugins.segment.Compactor$CompactDiff.propertyAdded(Compactor.java:227)
at org.apache.jackrabbit.oak.plugins.segment.CancelableDiff.propertyAdded(CancelableDiff.java:47)
at org.apache.jackrabbit.oak.plugins.memory.EmptyNodeState.compareAgainstEmptyState(EmptyNodeState.java:156)
at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:434)
at org.apache.jackrabbit.oak.plugins.segment.Compactor$CompactDiff.diff(Compactor.java:214)
at org.apache.jackrabbit.oak.plugins.segment.Compactor$CompactDiff.childNodeAdded(Compactor.java:263)
at org.apache.jackrabbit.oak.plugins.segment.CancelableDiff.childNodeAdded(CancelableDiff.java:74)
at org.apache.jackrabbit.oak.plugins.memory.EmptyNodeState.compareAgainstEmptyState(EmptyNodeState.java:161)
at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:434)
at org.apache.jackrabbit.oak.plugins.segment.Compactor$CompactDiff. diff (Compactor.java:214)
at org.apache.jackrabbit.oak.plugins.segment.Compactor$CompactDiff.childNodeAdded(Compactor.java:263)
at org.apache.jackrabbit.oak.plugins.segment.CancelableDiff.childNodeAdded(CancelableDiff.java:74)
at org.apache.jackrabbit.oak.plugins.memory.EmptyNodeState.compareAgainstEmptyState(EmptyNodeState.java:161)
at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:434)
at org.apache.jackrabbit.oak.plugins.segment.Compactor$CompactDiff. diff (Compactor.java:214)
at org.apache.jackrabbit.oak.plugins.segment.Compactor$CompactDiff.childNodeAdded(Compactor.java:263)
at org.apache.jackrabbit.oak.plugins.segment.CancelableDiff.childNodeAdded(CancelableDiff.java:74)
at org.apache.jackrabbit.oak.plugins.memory.EmptyNodeState.compareAgainstEmptyState(EmptyNodeState.java:161)
at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:434)
at org.apache.jackrabbit.oak.plugins.segment.Compactor$CompactDiff. diff (Compactor.java:214)
at org.apache.jackrabbit.oak.plugins.segment.Compactor$CompactDiff.childNodeAdded(Compactor.java:263)
at org.apache.jackrabbit.oak.plugins.segment.CancelableDiff.childNodeAdded(CancelableDiff.java:74)

Resolution resolution

Solution

There are multiple procedures we can follow to resolve the situation and complete the Offline Compaction successfully.

Important:  Perform a full backup of your repository before following the steps below.

A. Revert to the last known good revision of the segment store.

The check run-mode of oak-run can be used to determine the last known good revision of a segment store.
That can be used to manually revert a corrupt segment store to its latest good revision.

Caution: This process will rollback the data in the system to a previous point in time.
If you would like to avoid losing changes in your system, then you can try Option B below instead.

To perform the check and restore, follow these steps:

  1. Download the oak-run jar file from here https://repo1.maven.org/maven2/org/apache/jackrabbit/oak-run/

  2. Stop AEM.

  3. Run this command:

    java -jar oak-run-*.jar check --bin=-1 crx-quickstart/repository/segmentstore/

    This command searches backwards through the revisions until it finds a consistent one:

    14:00:30.783 [ main] INFO  o.a.j.o.p.s.f.t.ConsistencyChecker - Found latest good revision afdb922d-ba53-4a1b-aa1b-1cb044b535cf:234880

    (In case the ConsistencyChecker fails, go to the next section)

  4. Revert the repository to this revision by editing:

    /crx-quickstart/repository/segmentstore/journal.log

    Delete all lines after the line containing the latest good revision.

    If you would like to find out what date and time you are reverting the repository to, run this command in the segmentstore folder (replace afdb922d-ba53-4a1b-aa1b-1cb044b535cf  with the latest good revision in your journal.log):

    find . -type f -name "data*.tar" -exec sh -c "tar -tvf {} |grep afdb922d-ba53-4a1b-aa1b-1cb044b535cf" \; -print

    The output would show you an approximate date and time of that revision.

  5. Remove all ./crx-quickstart/repository/segmentstore/*.bak files.

  6. If using AEM 6.0, then download the oak-run version matching what is installed in AEM for the remaining steps.

    Download it from here https://repo1.maven.org/maven2/org/apache/jackrabbit/oak-run/.

  7. Run checkpoint clean-up to remove orphaned checkpoints:

    java -jar oak-run-*.jar checkpoints ./crx-quickstart/repository/segmentstore rm-unreferenced

  8. Finally compact the repository:

    java -jar oak-run-*.jar compact ./crx-quickstart/repository/segmentstore/

B. Remove corrupted nodes manually.

In AEM, TarMK setups with no FileDatastore configured and situations where corruption is in the binaries, you can do the following.

*Caution:*The procedure below is intended for power users.
When deleting the corrupt nodes you need to make sure they are not system nodes (such as /home, /jcr:system, etc.) .
Or if they are system nodes then you need to make sure you could restore them.
Please consult with AEM Customer Care team for assistance with the steps documented here if you are unsure.

  1. Stop AEM.

  2. Use the Oak run console and load the childCount groovy script to identify the corrupt nodes in the segment store:

    Load the oak-run console shell:

    java -jar oak-run-*.jar console crx-quickstart/repository/segmentstore

    Run the two commands below in the shell to load the script and run it:

    :load

    https://gist.githubusercontent.com/stillalex/e7067bcb86c89bef66c8/raw/d7a5a9b839c3bb0ae5840252022f871fd38374d3/childCount.groovy

    countNodes(session.workingNode)

    This results in the following output indicating the path to the corrupted node(s):

    21:21:42.029 [ main] ERROR o.a.j.o.p.segment.SegmentTracker - Segment not found: 63ae05a4-b506-445c-baa2-cfa1b13b6e2f. Creation date delta is 3 ms.

    warning unable to read node /content/dam/test.txt/jcr:content/renditions/original/jcr:content

    In some cases the issue is linked to binary properties and the childCount groovy script is unable to locate any corrupted nodes.

    In these cases you can use the following command instead which will read the first 1024 bytes for every binary encountered during the traversal (Please note this command will be slower and should only be used when the one above doesn’t return the expected results):

    countNodes(session.workingNode,true)

  3. Remove all the identified corrupted nodes listed in the output of the last command using rmNodes.groovy.

    Load the oak-run console shell:

    java -jar oak-run-*.jar console crx-quickstart/repository/segmentstore

    Load the groovy script:

    :load

    https://gist.githubusercontent.com/stillalex/43c49af065e3dd1fd5bf/raw/9e726a59f75b46e7b474f7ac763b0888d5a3f0c3/rmNode.groovy

    Run the rmNode command to remove the corrupt node, replace /path/to/corrupt/node with the path to the corrupt node you need to remove.

    rmNode(session, "/path/to/corrupt/node")

    Where the corrupt node path is the path obtained in step 2, for example: /content/dam/test.txt/jcr:content/renditions/original/jcr:content/.
    Note: When using oak-run.jar version 1.6.13 and up, set --read-write JVM parameter if you run into an error such as:

    code language-none
    /> rmNode(session,"/path/to/corrupt/node")
    Removing node /path/to/corrupt/node
    ERROR java.lang.UnsupportedOperationException:
    Cannot write to read-only store
    at org.apache.jackrabbit.oak.segment.SegmentWriterBuilder$1.execute (SegmentWriterBuilder.java:171)
    at org.apache.jackrabbit.oak.segment.SegmentWriter.writeNode (SegmentWriter.java:318)
    at org.apache.jackrabbit.oak.segment.SegmentNodeBuilder.getNodeState (SegmentNodeBuilder.java:111)
    at org.apache.jackrabbit.oak.segment.SegmentNodeStore$Commit.<init> (SegmentNodeStore.java:581)
    at org.apache.jackrabbit.oak.segment.SegmentNodeStore.merge (SegmentNodeStore.java:333)
    at org.apache.jackrabbit.oak.spi.state.NodeStore$merge.call (Unknown Source)
    at groovysh_evaluate.rmNode (groovysh_evaluate:11)
    
  4. Repeat step 3 for all nodes found in step 2.

    This above rmNode command should return true for the corrupt path, which means it deleted it.

    Make sure these found three corrupt paths get deleted by re-running the rmNode command on those paths.

    For the next run it should return false.

    If still you see that the same paths are there in repository, then use the patched version of oak-run jar  (i.e.  oak-run-1.2.18-NPR-17596).

    What does patched version of oak-run jar do?

    This version of jar skips unreadable binaries on compaction, replacing them with 0-byte binaries and logging the exception and the path to syserr.

    The thus-compacted repository should then pass oak-run check, the node count script and you should also be able to compact it again using a non-patched oak-run.

  5. Perform a checkpoint cleanup by listing checkpoints using below.

    If there are more than one checkpoint, then clean them up:

    nohup java -Xmx4096m -jar oak-run-1.2.18.jar checkpoints /app/AEM6/author/crx-quickstart/repository/segmentstore rm-all>>nohup.out &

  6. Run an offline compaction.

    If you do not know how to run offline compaction, then see Oak offline compaction instructions on GitHub Gist.

  7. Start the server and wait for indexing completion.

Cause
A SegmentNotFoundException is returned when a segment is not present while the compaction attempts to read the node.

There can be different root causes for this:

  1. The segment has been removed by manual intervention (For example: rm -rf /).
  2. The segment has been removed by revision garbage collection.
  3. The segment cannot be found due to some bug in the code.

In case the issue is caused by revision garbage collection (Cause #2), make sure that online compaction is disabled to avoid further nodes from getting corrupted.

recommendation-more-help
3d58f420-19b5-47a0-a122-5c9dab55ec7f