From the Adobe Experience Manager (AEM) Assets standpoint, monitoring should include observing and reporting on the following processes and technologies:
System memory usage
System disk IO and IO wait time
System network IO
JMX MBeans for:
OSGi console health checks
Typically, AEM Assets can be monitored in two ways, live monitoring and long term monitoring.
You should perform live monitoring during the performance testing phase of your development or during high-load situations to understand the performance characteristics of your environment. Typically, live monitoring should be performed using a suite of tools. Here are some recommendations:
Visual VM: Visual VM enables you to view detailed Java VM information, including CPU usage, Java memory usage. In addition, it lets you sample and evaluate code that runs on an instance.
Top: Top is a Linux command that opens up a dashboard, which displays usage statistics, including CPU, memory, and IO usage. It provides a high-level overview of what is happening on an instance.
Htop: Htop is an interactive process viewer. It provides detailed CPU and memory usage in addition to what Top can provide. Htop can be installed on most Linux systems using
yum install htop or
apt-get install htop.
Iotop: Iotop is a detailed dashboard for disk IO usage. It displays bars and meters that depict the processes that use disk IO and the amount they use. Iotop can be installed on most Linux systems using
yum install iotop or
apt-get install iotop.
Iftop: Iftop displays detailed information about ethernet/network usage. Iftop displays per communication channel statistics on the entities using ethernet and the amount of bandwidth they use. Iftop can be installed on most Linux systems using
yum install iftop or
apt-get install iftop.
Java Flight Recorder (JFR): A commercial tool from Oracle that you can use freely in non-production environments. For more details, see How to Use Java Flight Recorder to Diagnose CQ Runtime Problems.
AEM error.log file: You can investigate the AEM error.log file for details of errors logged in the system. Use the command
tail -F quickstart/logs/error.log to identify errors that you should investigate.
Workflow console: Leverage the workflow console to monitor workflows that lag behind or get stuck.
Typically, you use these tools together to obtain a comprehensive idea about the performance of your AEM instance.
These tools are standard tools and not directly supported by Adobe. They don’t require additional licenses.
Long term monitoring of an AEM instance involves monitoring for a longer duration the same portions that are monitored live. It also includes defining alerts specific to your environment.
There are several tools available to aggregate logs, for example Splunk™ and Elastic Search/Logstash/Kabana (ELK). To evaluate the uptime of your AEM instance, it is important for you to understand log events specific to your system and create alerts based on them. A good knowledge of your development and operations practices can help you better understand how to tune your log aggregation process to generate critical alerts.
Environment monitoring includes monitoring the following:
You require external tools, such as NewRelic™ and AppDynamics™ to monitor each item. Using these tools, you can define alerts specific to your system, for example high system utilization, workflow back up, health check failures, or unauthenticated access to your website. Adobe does not recommend any particular tools over others. Find the tool that works for you, and leverage it to monitor the items discussed.
Internal application monitoring includes monitoring the application components that make up the AEM stack, including JVM, the content repository, and monitoring through custom application code built on the platform. In general, it is performed through JMX Mbeans that can be monitored directly by many popular monitoring solutions, such as SolarWinds ™, HP OpenView™, Hyperic™, Zabbix™, and others. For systems that do not support a direct connection to JMX, you can write shell scripts to extract the JMX data and expose it to these systems in a format that they natively understand.
Remote access to the JMX Mbeans is not enabled by default. For more information on monitoring through JMX, see Monitoring and Management Using JMX Technology.
In many cases, a baseline is required to effectively monitor a statistic. To create a baseline, observe the system under normal working conditions for a predetermined period and then identify the normal metric.
As with any Java-based application stack, AEM depends on the resources that are provided to it through the underlying Java Virtual Machine. You can monitor the status of many of these resources through Platform MXBeans that are exposed by JVM. For more information on MXBeans, see Using the Platform MBean Server and Platform MXBeans.
Here are some baseline parameters that you can monitor for JVM:
Note: Information provided by this bean is expressed in bytes.
AEM also exposes a set of statistics and operations through JMX. These can help assess system health and identify potential problems before they impact users. For more information, see documentation on AEM JMX MBeans.
Here are some baseline parameters that you can monitor for AEM:
Instances: One Author and all publish instances (for flush agents)
Alarm threshold: When the value of
QueueBlocked is true or the value of
QueueNumEntries is greater than 150% of the baseline.
Alarm definition: Presence of a blocked queue in the system indicating that the replication target is down or unreachable. Often, network or infrastructure issues cause excessive entries to be queued, which can adversely impact system performance.
Note: For the MBean and URL parameters, replace
<AGENT_NAME> with the name of the replication agent you want to monitor.
Health checks that are available in the operations dashboard have corresponding JMX MBeans for monitoring. However, you can write custom health checks to expose additional system statistics.
Here are some out-of-the-box health checks that are helpful to monitor:
In the process of monitoring, if you encounter issues, here are some troubleshooting tasks that you can perform to resolve common issues with AEM instances:
OutOfMemoryErrorlogs. For more information, see Analyze memory problems.
error.logfiles for entries around the time of something went wrong. Look for patterns that can potentially indicate custom code anomalies. Add them to the list of events you monitor.