How to analyze common critical AEM issues

Last update: Tue Jul 23 2024 00:00:00 GMT+0000 (Coordinated Universal Time)

Learn about the most common critical AEM issues and how to analyze them.

Description description

Environment

Adobe Experience Manager (AEM)

Issue/Symptoms

This article describes the most common critical AEM issues and how to analyze them.

AEM Sites Performance
AEM Assets Performance
Memory Issues
Indexing Issues
Replication Issues
TarMK Corruption Issues

Resolution resolution

AEM Sites Performance Issues

Symptoms of a performance issue

Slow loading of pages
Slow creation or editing of pages
AEM response times are slow
AEM is not responding to some requests
The request.log on AEM shows slow response times

What causes performance issues

Thread contention: Long-running requests such as slow searches, write-heavy background jobs, moving of whole branches of site content, etc.
High CPU utilization
Expensive requests such as expensive searches or inefficient application code, components, etc.
Lack of proper maintenance
Insufficient dispatcher caching
Lack of CDN
Lack of browser caching
Too many scripts loaded on the page and loaded at the top of the page
CSS loaded throughout page instead of in the HTML head
Insufficient server sizing or incorrect architecture
Memory issues (see below)

How to analyze the performance issue

Capture a series of thread dumps and analyze them.
Check at the OS level if the AEM java process is causing high CPU utilization: If AEM is causing high CPU utilization then run the out-of-the-box profiling tool for a few minutes and analyze the result.
- Linux: use the top command to check CPU utilization.
- Window: use the Windows Task Manager
Analyze the request.log file for any slow requests.
Review your system maintenance procedures. See this article for details on AEM maintenance and ensure that you are doing proper maintenance on AEM including:
- Revision Clean Up (MongoMK and Database DocumentNodeStore’s only) - daily or more frequent
- Offline Tar Compaction (TarMK only) - bi-weekly
- Data Store Garbage Collection (Systems with FileDataStore or S3 DataStore only) - weekly
- Workflow Purge - weekly
- Version Purge - weekly
- AuditLog Purge - weekly
Review caching strategies implemented at the AEM dispatcher level.
Review your site’s caching.
Use client-side site analysis tools such as the Audits feature in Google Chrome browser Developer Tools panel. These tools will give you recommendations on client-side performance improvements.

Solutions to common performance issues

Refer to this article for detailed steps on ways to optimize performance.
Review the performance tuning tips

Assets Performance Issues

Symptoms of an Assets performance issue

Slow file uploads to /assets.html or /damadmin UI
Thumbnails are taking too long to be generated
Assets operations such as move, delete, edit, and metadata update taking too long

What causes issues with Assets performance

Lack of proper maintenance
Latest fix packs not applied
Optimizations not applied
Inadequate server sizing for the user load

How to analyze the Assets performance issue

Review steps 1-4 in sites performance analysis above.

Solutions to common Assets performance issues

Review the AEM Assets Performance Tuning Guide.
Solutions to some issue scenarios and their solutions can be found here.
Tune asset processing performance, see this article.

Memory Issues

Symptoms of a memory issue

AEM crashes randomly and in the logs OutOfMemoryError is observed
AEM gets slower over time and eventually crashes
AEM is unresponsive

Diagnosing a memory issue

Search the log files for OutOfMemoryError, if you find any matches then you have a memory issue
Review the http://aem-host:port/system/console/memoryusage screen

If the “Old Generation” (JDK 7 and earlier) or “Tenured Generation” (JDK8 or later) usage is high then this could be a sign of a heap memory utilization issue. Click “Run Garbage Collector” to request the JVM to run a full heap garbage collection. If the high heap utilization stays high after requesting GC then there is likely an issue. On an AEM instance with Oak Tar storage, if the tenured usage is higher than 3GB then there might be a problem. High heap utilization on a system with Mongo storage could be due to the in-memory cache configuration.
Take thread dumps and top output and perform thread analysis. Check if the threads causing high CPU utilization are native JVM Garbage Collection threads. If the thread using the most CPU time are the “VM Thread” or any garbage collection threads then there is likely a memory issue.

What causes memory issues

Java application memory leak
Java Finalizer pile up due to incorrect use of finalizing in custom code
Insufficient max heap configuration

How to analyze the cause of your memory issue

See this article for details on how to capture a heap dump.

The best way to identify the cause of a memory issue is to analyze a heap dump.

Once you’ve captured a Heap Dump file then open it in Eclipse MAT or IBM Memory Analyzer tool. In Eclipse MAT, run the Leak Suspects report and open “Thread Details” view to see potential causes for the memory issue.

Solutions to common memory issues

Optimize your application code to utilize less memory if you notice long garbage collection pauses. Most Garbage Collection issues can best be solved by optimizing the application versus tuning the JVM.
If you have already optimized your application and still experience long GC pauses then focus on tuning the JVM.

AEM Indexing Issue

Symptoms of indexing issues

The following are signs of an issue with AEM/Oak indexing:

Search results are outdated by more than 10 minutes
There are missing search results
Errors are returned either in the UI or logs during search via site UI, Query Builder search, or JCR query execution

Diagnosing an indexing issue

To see if asynchronous indexing is slow or failing, do the following:

Open these URLs on your AEM instance to view stats about the Async indexer: http://aemhost:port/system/console/jmx/org.apache.jackrabbit.oak%3Aname%3Dasync%2Ctype%3DIndexStats http://aemhost:port/system/console/jmx/org.apache.jackrabbit.oak%3Aname%3Dfulltext-async%2Ctype%3DIndexStats - This URL only applies to AEM6.2 and later
On each of those pages, check these fields:

FailingSince - This indicates when indexing first started failing.

LastError - This is the stack trace showing what is causing indexing to fail. If this is empty then indexing isn’t failing.

LastErrorTime - This indicates the last time indexing threw the error.

LastIndexedTime - If the date and time of this field is over 5 minutes old then indexing is running too slow.

What causes issues with indexing

Improper maintenance or failure to perform maintenance such as Revision Garbage Collection, Workflow Purge, Audit Purge, Version Purge, etc.
Corrupt or missing segments in Tar storage
Revision Corruption in a clustered environment (DocumentNodeStore - Mongo or Database)
An issue with the cluster topology in a clustered environment

How to analyze what is causing indexing issues

See this article for analyzing and fixing indexing issues

Replication Issues

Symptoms of Replication Issues

Publish requests are queueing up in the replication agent queue
Published contents are not showing up on the publish server
Impact on system performance

What causes Replication issues:

Replication agent is misconfigured and cannot connect to the publish agent
There is an error at the time of replication causing the replication queue to get stuck
The system is slow and replications are getting processed slowly
The replication is happening as part of a custom workflow and the problem is with workflow processing.

How to analyze Replication issues:

Check the replication queue status:

Active: when items are being processed.

Idle: when the queue is empty.

Blocked: when items are in the queue, but cannot be processed; for example, when the agent points to a host that is down or non-existent.
Review the replication configurations if your server is cloned or the agent has been configured recently. For details, see here.
Review the replication agent logs at http://host:port/etc/replication/agents.author/AgentName.log.html#end. If you can’t identify any items collect this log and present to AEM support.
Review the server error.log from AEMinstall/crx-quickstart/logs; If you can’t identify any items collect this log and present to AEM support.
If the replication queue is in “idle” state and none of the above applies, in this case the problem is most likely caused by the workflows. If the workflows are not being processed then the replication item never gets to the replication queue. To monitor the status of your workflows, you can check the workflow dashboard to check the number of running workflow instances. You can read about administering workflows here.
Replications slows down when the system is under high load or experience other performance issues.

Solution to Common Replication issues:

Review the Replication queue issues.
If the problem is due to the workflows not running efficiently, you may review the concurrent workflow processing tips.

TarMK Corruption Issues

Symptoms of TarMK Corruption

Instance is inoperable after offline compaction.
Instance stuck in Startup in progress state.
Log files or compaction command output report SegmentNotFoundException.

What causes corruption issues

The segment is removed by manual intervention (e.g. rm -rf ).
The segment is removed by revision garbage collection or the segment cannot be found due to some bug in the code.
The segment cannot be found due to some bug in the code.
Various maintenance tasks are not performed on time leading to repository growth and low disk space.
Forcefully stopping AEM by killing java process.

Diagnosing repository corruption issues:

Review the error.log file and check if there is SegmentNotFoundException or IllegalArgument Exception.
To determine whether a segment has been removed by revision garbage collection, check the output of the org.apache.jackrabbit.oak.plugins.segment.file.TarReader-GC (enable debug log) logger. That logger logs the segment ids of all segments removed by the cleanup phase. Only when the offending segment id appears in the output of that logger is revision garbage collection the cause for the exception.
In case of corruption in external datastore, search log file for all occurrences of error Error occurred while obtaining InputStream for blobId. This error means that you are missing files from your AEM datastore directory.

Solution to repair corruption issues:

Determine the last known good revision of the segment store by using the check run-mode of oak-run. Manually revert the corrupt segment store to its latest good revision. This operation will revert the Oak repository to a previous state in time. You should completely backup the repository before performing this operation.
- To perform check and restore, follow steps mentioned in this article.
- If the check fails with ConsistencyChecker - No good revisions found then implement the steps in part B of this article.
If you are not using a datastore, then use an external file, S3 or Azure datastore, instead of default segmentstore.
- Using a datastore provides better performance.
- Migrate the instance to one with a datastore using crx2oak.
Apply the latest Service Pack and Cumulative Fix Pack and Oak Cumulative Fix Pack.

recommendation-more-help

3d58f420-19b5-47a0-a122-5c9dab55ec7f