Long-term monitoring
Long term monitoring of an Experience Manager deployment involves monitoring for a longer duration the same portions that are monitored live. It also includes defining alerts specific to your environment.
Log aggregation and reporting
There are several tools available to aggregate logs, for example, Splunk™ and Elastic Search, Logstash, and Kabana (ELK). To evaluate the uptime of your Experience Manager deployment, it is important for you to understand log events specific to your system and create alerts based on them. A good knowledge of your development and operations practices can help you better understand how to tune your log aggregation process to generate critical alerts.
Environment monitoring
Environment monitoring includes monitoring the following:
- Network throughput
- Disk IO
- Memory
- CPU utilization
- JMX MBeans
- External websites
You require external tools, such as NewRelic™ and AppDynamics™ to monitor each item. Using these tools, you can define alerts specific to your system, for example, high system utilization, workflow back up, health check failures, or unauthenticated access to your website. Adobe does not recommend any particular tools over others. Find the tool that works for you, and use it to monitor the items discussed.
Internal application monitoring
Internal application monitoring includes monitoring the application components that make up the Experience Manager stack, including JVM, the content repository, and monitoring through custom application code built on the platform. In general, it is performed through JMX Mbeans that can be monitored directly by many popular monitoring solutions, such as SolarWinds ™, HP OpenView™, Hyperic™, Zabbix™, and others. For systems that do not support a direct connection to JMX, you can write shell scripts to extract the JMX data and expose it to these systems in a format that they natively understand.
Remote access to the JMX Mbeans is not enabled by default. For more information on monitoring through JMX, see Monitoring and Management Using JMX Technology.
In many cases, a baseline is required to effectively monitor a statistic. To create a baseline, observe the system under normal working conditions for a predetermined period and then identify the normal metric.
JVM monitoring
As with any Java-based application stack, Experience Manager depends on the resources that are provided to it through the underlying Java Virtual Machine. You can monitor the status of many of these resources through Platform MXBeans that are exposed by JVM. For more information on MXBeans, see Using the Platform MBean Server and Platform MXBeans.
Here are some baseline parameters that you can monitor for JVM:
Memory
MBean: lava.lang:type=Memory
- URL:
/system/console/jmx/java.lang:type=Memory
- Instances: All servers
- Alarm threshold: When the heap or non-heap memory utilization exceeds 75% of the corresponding maximum memory.
- Alarm definition: Either system memory is insufficient, or there is a memory leak in the code. Analyze a thread dump to arrive at a definition.
Threads
- MBean:
java.lang:type=Threading
- URL:
/system/console/jmx/java.lang:type=Threading
- Instances: All servers
- Alarm threshold: When the number of threads is greater than 150% of the baseline.
- Alarm definition: Either there is an active runaway process, or an inefficient operation consumes a large amount of resources. Analyze a thread dump to arrive at a definition.
Monitor Experience Manager
Experience Manager also exposes a set of statistics and operations through JMX. These can help assess system health and identify potential problems before they impact users. For more information, see documentation on Experience Manager JMX MBeans.
Here are some baseline parameters that you can monitor for Experience Manager:
Replication agents
-
MBean:
com.adobe.granite.replication:type=agent,id="<AGENT_NAME>"
-
URL:
/system/console/jmx/com.adobe.granite.replication:type=agent,id="<AGENT_NAME>"
-
Instances: One Author and all publish instances (for flush agents)
-
Alarm threshold: When the value of
QueueBlocked
istrue
or the value ofQueueNumEntries
is greater than 150% of the baseline. -
Alarm definition: Presence of a blocked queue in the system indicating that the replication target is down or unreachable. Often, network or infrastructure issues cause excessive entries to be queued, which can adversely impact system performance.
<AGENT_NAME>
with the name of the replication agent you want to monitor.Session counter
- MBean:
org.apache.jackrabbit.oak:id=7,name="OakRepository Statistics",type="RepositoryStats"
- URL: /system/console/jmx/org.apache.jackrabbit.oak:id=7,name=“OakRepository Statistics”,type=“RepositoryStats”
- Instances: All servers
- Alarm threshold: When open sessions exceed the baseline by more than 50%.
- Alarm definition: Sessions may be opened through a piece of code and never close. This may happen slowly over time and eventually cause memory leaks in the system. While the number of sessions should fluctuate on a system, they should not increase continuously.
Health Checks
Health checks that are available in the operations dashboard have corresponding JMX MBeans for monitoring. However, you can write custom health checks to expose additional system statistics.
Here are some out-of-the-box health checks that are helpful to monitor:
-
System Checks
- MBean:
org.apache.sling.healthcheck:name=systemchecks,type=HealthCheck
- URL:
/system/console/jmx/org.apache.sling.healthcheck:name=systemchecks,type=HealthCheck
- Instances: One author, all publish servers
- Alarm threshold: When the status is not OK
- Alarm definition: The status of one of the metrics is either WARN or CRITICAL. Check the log attribute for more information on the cause of the issue.
- MBean:
-
Replication Queue
- MBean:
org.apache.sling.healthcheck:name=replicationQueue,type=HealthCheck
- URL:
/system/console/jmx/org.apache.sling.healthcheck:name=replicationQueue,type=HealthCheck
- Instances: One author, all publish servers
- Alarm threshold: When the status is not OK
- Alarm definition: The status of one of the metrics is either WARN or CRITICAL. Check the log attribute for more information on the queue that caused the issue.
- MBean:
-
Response Performance
- MBean:
org.apache.sling.healthcheck:name=requestsStatus,type=HealthCheck
- URL:
/system/console/jmx/org.apache.sling.healthcheck:name=requestsStatus,type=HealthCheck
- Instances: All servers
- Alarm duration: When the status is not OK
- Alarm definition: The status of one of the metrics is either WARN or CRITICAL status. Check the log attribute for more information on the queue that caused the issue.
- MBean:
-
Query Performance
- MBean:
org.apache.sling.healthcheck:name=queriesStatus,type=HealthCheck
- URL:
/system/console/jmx/org.apache.sling.healthcheck:name= queriesStatus,type=HealthCheck
- Instances: One author, all publish servers
- Alarm threshold: When the status is not OK
- Alarm definition: One or more queries running slowly in the system. Check the log attribute for more information on the queries that caused the issue.
- MBean:
-
Active Bundles
- MBean:
org.apache.sling.healthcheck:name=inactiveBundles,type=HealthCheck
- URL:
/system/console/jmx/org.apache.sling.healthcheck:name=inactiveBundles,type=HealthCheck
- Instances: All servers
- Alarm threshold: When the status is not OK
- Alarm definition: Presence of inactive or unresolved OSGi bundles on the system. Check the log attribute for more information on the bundles that caused the issue.
- MBean:
-
Log Errors
- MBean:
org.apache.sling.healthcheck:name=logErrorHealthCheck,type=HealthCheck
- URL:
/system/console/jmx/org.apache.sling.healthcheck:name=logErrorHealthCheck,type=HealthCheck
- Instances: All servers
- Alarm threshold: When the status is not OK
- Alarm definition: There are errors in the log files. Check the log attribute for more information on the cause of the issue.
- MBean:
Common issues and resolutions
In the process of monitoring, if you encounter issues, here are some troubleshooting tasks that you can perform to resolve common issues with Experience Manager deployments:
-
If using TarMK, run Tar compaction often. For more details, see Maintain the repository.
-
Check
OutOfMemoryError
logs. For more information, see Analyze Memory Problems. -
Check the logs for any references to unindexed queries, tree traversals, or index traversals. These indicate unindexed queries or inadequately indexed queries. For For best practices on optimizing query and indexing performance, see Best practices for queries and indexing.
-
Use the workflow console to verify that your workflows perform as expected. If possible, condense multiple workflows into a single workflow.
-
Revisit live monitoring, and look for additional bottlenecks or high consumers of any specific resources.
-
Investigate the egress points from the client network and the ingress points to the Experience Manager deployment network, including the dispatcher. Frequently, these are bottleneck areas. For more information, see Assets network considerations.
-
Up-size your Experience Manager server. You may have an inadequately sized your Experience Manager deployment. Adobe Customer Support can help you identify whether your server is undersized.
-
Examine the
access.log
anderror.log
files for entries around the time of something went wrong. Look for patterns that can potentially indicate custom code anomalies. Add them to the list of events you monitor.
Experience Manager
Driving Marketing Agility and Scale: Transforming your Content Supply Chain with AI
Marketers everywhere are feeling the pressure to deliver impactful campaigns faster and at greater scale. This Strategy Keynote explores...
Tue, Mar 18, 2:30 PM PDT (9:30 PM UTC)
Adobe’s Top 10 Generative AI Capabilities to Accelerate Your Content Supply Chain
AI is rapidly transforming marketing content and customer experiences, but many companies face significant questions: How do I safely and...
Tue, Mar 18, 2:00 PM PDT (9:00 PM UTC)
Connect with Experience League at Summit!
Get front-row access to top sessions, hands-on activities, and networking—wherever you are!
Learn more