The Operations Dashboard in AEM 6 helps system operators to monitor AEM system health at a glance. It also provides auto-generated diagnosis informations on relevant aspects of AEM and allows to configure and run self-contained maintenance automation to reduce project operations and support cases significantly. The Operations Dashboard can be extended with custom health checks and maintenance tasks. Further, Operations Dashboard data can be accessed from external monitoring tools via JMX.
The Operations Dashboard:
It can be accessed by going to Tools - Operations from the AEM Welcome screen.
In order to be able to access the Operations Dashboard, the logged in user must be part of the “Operators” user group. For more info, see documentation on User, Group and Access Right Administration.
The Health Report system provides information on the health of an AEM instance through Sling Health Checks. This can be done via either OSGI, JMX, HTTP requests (via JSON) or through the Touch UI. It offers measurements and threshold of certain configurable counters and in some cases, will offer information on how to resolve the issue.
It has several features, described below.
The Health Reports are a system of cards indicating good or bad health with regard to a specific product area. These cards are visualizations of the Sling Health Checks, which aggregate data from JMX and other sources and expose processed information again as MBeans. These MBeans can also be inspected in the JMX web console, under the org.apache.sling.healthcheck domain.
The Health Reports interface can be accessed through the Tools - Operations - Health Reports menu on the AEM Welcome screen, or directly through the following URL:
https://<serveraddress>:port/libs/granite/operations/content/healthreports/healthreportlist.html
The card system exposes three possible states: OK, WARN and CRITICAL. The states are a result of rules and thresholds, which can be configured by hovering the mouse over the card and then clicking the gear icon in the action bar:
There are two types of health checks in AEM 6:
An Individual Health Check is a single health check that corresponds to a status card. Individual Health Checks can be configured with rules or thresholds and they can provide one or more hints and links to solve identified health issues. Let’s take the “Log Errors” check as an example: if there are ERROR entries in the instance logs, you will find them on the details page of the health check. At the top of the page you will see a link to the “Log Message” analyzer in the Diagnosis Tools section, which will enable you to analyze these errors in more detail and reconfigure the loggers.
A Composite Health Check is a check that aggregates information from several individual checks.
Composite health checks are configured with the aid of filter tags. In essence, all single checks that have the same filter tag will be grouped as a composite health check. A Composite Health Check will have an OK status only if all the single checks it aggregates have OK statuses as well.
In the Operations Dashboard you can visualize the result of both individual and composite Health Checks.
Creating an individual Health Check involves two steps: implementing a Sling Health Check and adding an entry for the Health Check in the Dashboard’s configuration nodes.
In order to create a Sling Health Check, you need to create an OSGI component implementing the Sling HealthCheck interface. You will add this component inside a bundle. The properties of the component will fully identify the Health Check. Once the component is installed, a JMX MBean will automatically be created for the Health Check. See the Sling Health Check Documentation for more information.
Example of a Sling Health Check component, written with OSGI service component annotations:
@Component(service = HealthCheck.class,
property = {
HealthCheck.NAME + "=Example Check",
HealthCheck.TAGS + "=example",
HealthCheck.TAGS + "=test",
HealthCheck.MBEAN_NAME + "=exampleHealthCheckMBean"
})
public class ExampleHealthCheck implements HealthCheck {
@Override
public Result execute() {
// health check code
}
}
The MBEAN_NAME
property defines the name of the mbean that will be generated for this health check.
After creating a Health Check, a new configuration node needs to be created, in order to make it accessible in the Operations Dashboard interface. For this step, it is necessary to know the JMX Mbean name of the Health Check (the MBEAN_NAME
property). To create a configuration for the Health Check, open CRXDE and add a new node (of type nt:unstructured) under the following path: /apps/settings/granite/operations/hc
The following properties should be set on the new node:
Name: sling:resourceType
String
granite/operations/components/mbean
Name: resource
String
/system/sling/monitoring/mbeans/org/apache/sling/healthcheck/HealthCheck/exampleHealthCheck
The resource path above is created as follows: if the mbean name of your Health Check is “test”, add “test” to the end of the path /system/sling/monitoring/mbeans/org/apache/sling/healthcheck/HealthCheck
So the final path will be:
/system/sling/monitoring/mbeans/org/apache/sling/healthcheck/HealthCheck/test
Make sure that the /apps/settings/granite/operations/hc
path has the following properties set to true:
sling:configCollectionInherit
sling:configPropertyInherit
This will tell the configuration manager to merge the new configurations with the existing ones from /libs
.
A Composite Health Check’s role is to aggregate a number of individual Health Checks sharing a set of common features. For instance, the Security Composite Health Check groups together all the individual health checks performing security-related verifications. The first step in order to create a composite check is to add a new OSGI configuration. For it to be displayed in the Operations Dashboard, a new configuration node needs to be added, the same way we did for a simple check.
Go to the Web Configuration Manager in the OSGI Console. You can do this by accessing https://serveraddress:port/system/console/configMgr
Search for the entry called Apache Sling Composite Health Check. After you find it, notice that there are two configurations already available: one for the System Checks and another one for the Security Checks.
Create a new configuration by pressing the “+” button on the right hand side of the configuration. A new window will appear, as shown below:
Create a configuration and save it. A Mbean will be created with the new configuration.
The purpose of each configuration property is as follows:
hc.tags
).A new JMX Mbean is created for each new configuration of the Apache Sling Composite Health Check.**
Finally, the entry of the composite health check that has just been created needs to be added in the Operations Dashboard configuration nodes. The procedure for this is the same as with individual health checks: a node of type nt:unstructured needs to be created under /apps/settings/granite/operations/hc
. The resource property of the node will be defined by the value of hc.mean.name in the OSGI configuration.
If, for example, you created a configuration and set the hc.mbean.name value to diskusage, the configuration nodes will look like this:
Name: Composite Health Check
nt:unstructured
With the following properties:
Name: sling:resourceType
String
granite/operations/components/mbean
Name: resource
String
/system/sling/monitoring/mbeans/org/apache/sling/healthcheck/HealthCheck/diskusage
If you create individual health checks that logically belong under a composite check that is already present in the Dashboard by default, they will be automatically captured and grouped under the respective composite check. Because of this, there is no need to create a new configuration node for these checks.
For example, if you create an individual security health check, all you need to do is assign it the “security” tag, and it is installed, it will automatically appear under the Security Checks composite check in the Operations Dashboard.
zHealthcheck Name | Description |
Query Performance | This health check was simplified in AEM 6.4, and now checks the recently-refactored The MBean for this health check is org.apache.sling.healthcheck:name=queriesStatus,type=HealthCheck. |
Observation Queue Length | Observation Queue Length iterates over all Event Listeners and Background Observers, compares their
The maximum length of each queue comes from separate configurations (Oak and AEM), and is not configurable from this health check. The MBean for this health check is org.apache.sling.healthcheck:name=ObservationQueueLengthHealthCheck,type=HealthCheck. |
Query Traversal Limits | Query Traversal Limits checks the
The Mbean for this health check is org.apache.sling.healthcheck:name=queryTraversalLimitsBundle,type=HealthCheck. |
Synchronized Clocks | This check is relevant only for document nodestore clusters. It returns the following status:
The Mbean for this health check is org.apache.sling.healthcheck:name=slingDiscoveryOakSynchronizedClocks,type=HealthCheck. |
Asynchronous Indexes | The Asynchronous Indexes check:
Both the Critical and Warn status thresholds are configurable. The Mbean for this health check is org.apache.sling.healthcheck:name=asyncIndexHealthCheck,type=HealthCheck. Note: This health check is available with AEM 6.4 and has been backported to AEM 6.3.0.1. |
Large Lucene Indexes | This check uses the data exposed by the
The thresholds are configurable and the MBean for the health check is org.apache.sling.healthcheck:name=largeIndexHealthCheck,type=HealthCheck. Note: This check is available with AEM 6.4 and has been backported to AEM 6.3.2.0. |
System Maintenance | System Maintenance is a composite check that returns the OK if all maintenance tasks are running as configured. Keep in mind that:
The MBean for this health check is org.apache.sling.healthcheck:name=systemchecks,type=HealthCheck. |
Replication Queue | This check iterates over replication agents and looks at their queues. For the item at the top of the queue, the check looks at how many times the agent retried replication. If the agent retried replication more than the value of the The MBean for this health check is org.apache.sling.healthcheck:name=replicationQueue,type=HealthCheck. |
Sling Jobs |
Sling Jobs checks the number of jobs queued in the JobManager, compares it to the
maxNumQueueJobs threshold, and:
Only the maximum number of queued jobs parameter is configurable and it has the default value of 1000. The MBean for this health check is org.apache.sling.healthcheck:name=slingJobs,type=HealthCheck. |
Request Performance | This check looks at the
The MBean for this health check is org.apache.sling.healthcheck:name=requestsStatus,type=HealthCheck. |
Log Errors | This check returns the Warn status if there are errors in the log. The MBean for this health check is org.apache.sling.healthcheck:name=logErrorHealthCheck,type=HealthCheck. |
Disk Space | The Disk Space check looks at the
Both thresholds are configurable. The check only works on instances with a Segment Store. The MBean for this health check is org.apache.sling.healthcheck:name=DiskSpaceHealthCheck,type=HealthCheck. |
Scheduler Health Check | This check returns a warning if the instance has Quartz jobs running for more than 60 seconds. The acceptable duration threshold is configurable. The MBean for this health check is org.apache.sling.healthcheck:name=slingCommonsSchedulerHealthCheck,type=HealthCheck. |
Security Checks | The Security check is a composite which aggregates the results of multiple security-related checks. These individual health checks address different concerns from the security checklist available at the Security Checklist documentation page. The check is useful as a security smoke test when the instance is started. The MBean for this health check is org.apache.sling.healthcheck:name=securitychecks,type=HealthCheck |
Active Bundles | Active Bundles checks the state of all bundles and:
The ignore list parameter is configurable. The MBean for this health check is org.apache.sling.healthcheck:name=inactiveBundles,type=HealthCheck. |
Code Cache Check | This is a Health Check that verifies several JVM conditions that can trigger a CodeCache bug present in Java 7:
The The MBean for this health check is org.apache.sling.healthcheck:name=codeCacheHealthCheck,type=HealthCheck. |
Resource Search Path Errors | Checks if there are any resources in the path
The MBean for this health check is org.apache.sling.healthcheck:name=resourceSearchPathErrorHealthCheck,type=HealthCheck. |
By default, for an out-of-the-box AEM instance, the health checks run every 60 seconds.
You can configure the Period with the OSGi configuration Query Health Check Configuration (com.adobe.granite.queries.impl.hc.QueryHealthCheckMetrics).
The Health Check Dashboard can integrate with Nagios via the Granite JMX Mbeans. The below example illustrates how to add a check that shows used memory on the server running AEM.
Setup and install Nagios on the monitoring server.
Next, install the Nagios Remote Plugin Executor (NRPE).
For more info on how to install Nagios and NRPE on your system, please consult the Nagios Documentation.
Add a host definition for the AEM server. This can be done via the Nagios XI Web Interface, by using the Configuration Manager:
Below is an example of a host configuration file, in case you are using Nagios Core:
define host {
address 192.168.0.5
max_check_attempts 3
check_period 24x7
check-command check-host-alive
contacts admin
notification_interval 60
notification_period 24x7
}
Install Nagios and NRPE on the AEM server.
Install the check_http_json plugin on both servers.
Define a generic JSON check command on both servers:
define command{
command_name check_http_json-int
command_line /usr/lib/nagios/plugins/check_http_json --user "$ARG1$" --pass "$ARG2$" -u 'https://$HOSTNAME$:$ARG3$/$ARG4$' -e '$ARG5$' -w '$ARG6$' -c '$ARG7$'
}
Add a service for used memory on the AEM server:
define service {
use generic-service
host_name my.remote.host
service_description AEM Author Used Memory
check_command check_http_json-int!<cq-user>!<cq-password>!<cq-port>!system/sling/monitoring/mbeans/java/lang/Memory.infinity.json!{noname}.mbean:attributes.HeapMemoryUsage.mbean:attributes.used.mbean:value!<warn-threshold-in-bytes>!<critical-threshold-in-bytes>
}
Check your Nagios dashboard for the newly created service:
The Operation Dashboard also provides access to Diagnosis Tools that can help finding and troubleshooting root causes of the warnings coming from the Health Check Dashboard, as well as providing important debug information for system operators.
Amongst its most important features are:
You can reach the Diagnosis Tools screen by going to Tools - Operations - Diagnosis from the AEM Welcome screen. You can also access the screen by directly accessing the following URL: https://serveraddress:port/libs/granite/operations/content/diagnosis.html
The log messages User Interface will display all ERROR messages by default. If you want to have more log messages displayed, you need to configure a logger with the appropriate log level.
The log messages use an in memory log appender and therefore, are not related to the log files. Another consequence is that changing the log levels in this UI will not change the information that gets logged in the traditional log files. Adding and removing loggers in this UI will only affect the in memory logger. Also, note that changing the logger configurations will be reflected in the future of the in memory logger - the entries that are already logged and are not relevant anymore are not deleted, but similar entries will not be logged in the future.
You can configure what gets logged by providing logger configurations from the upper left gear button in the UI. There, you can add, remove or update logger configurations. A logger configuration is composed of a log level (WARN / INFO / DEBUG) and a filter name. The filter name has the role of filtering the source of the log messages that get logged. Alternatively, if a logger should capture all the log messages for the specified level, the filter name should be “root”. Setting the level of a logger will trigger the capture of all the messages with a level equal or higher than the one specified.
Examples:
If you plan on capturing all the ERROR messages - no configuration is required. All the ERROR messages are captured by default.
If you plan on capturing all the ERROR, WARN and INFO messages - the logger name should be set to: “root”, and the logger level to: INFO.
If you plan on capturing all the messages coming from a certain package (for example com.adobe.granite) - the logger name should be set to: “com.adobe.granite”, and the logger level to: DEBUG (this will capture all the ERROR, WARN, INFO and DEBUG messages), as shown in the image below.
You can not set a logger name to capture only ERROR messages via a specified filter. By default, all the ERROR messages are captured.
The log messages user interface does not reflect the actual error log. Unless you are configuring other types of log messages in the UI, you will see ERROR messages only. For how to display specific log messages, see instructions above.
The settings in the diagnosis page do not influence what is logged to the log files and vice-versa. So, while the error log might catch INFO messages, you might not see them in the log messages UI. Also, through the UI it’s possible to catch DEBUG messages from certain packages without it affecting the error log. For more information on how to configure the log files, see Logging.
With AEM 6.4, maintenance tasks are logged out of the box in a more information rich format at the INFO level. This allows for better visiblity into the state of the maintenance tasks.
In case you are using third party tools (such as Splunk) to monitor and react to maintenance task activity you can make use of the following log statements:
Log level: INFO
DATE+TIME [MaintanceLogger] Name=<MT_NAME>, Status=<MT_STATUS>, Time=<MT_TIME>, Error=<MT_ERROR>, Details=<MT_DETAILS>
The Request Performance page allows the analysis of the slowest page requests processed. Only content requests will be registered on this page. More specifically, the following requests will be captured:
/content
/etc/design
".html"
extensionThe page displays:
By default, the slowest 20 page requests are captured, but the limit can be modified in the Configuration Manager.
The Query Performance page allows the analysis of the slowest queries performed by the system. This information is provided by the repository in a JMX Mbean. In Jackrabbit, the com.adobe.granite.QueryStat
JMX Mbean provides this information, while in the Oak repository, it is offered by org.apache.jackrabbit.oak.QueryStats.
The page displays:
For any given query, Oak attempts to figure out the best way to execute based on the Oak indexes defined in the repository under the oak:index node. Depending on the query, different indexes may be chosen by Oak. Understanding how Oak is executing a query is the first step to optimizing the query.
The Explain Query is a tool that explains how Oak is executing a query. It can be accessed by going to Tools - Operations - Diagnosis from the AEM Welcome Screen, then clicking on Query Performance and switching over to the Explain Query tab.
Features
Once you are in the Explain Query UI, all you need to do in order to use it is enter the query and press the Explain button:
The first entry in the Query Explanation section is the actual explanation. The explanation will show the type of index that was used to execute the query.
The second entry is the execution plan.
Ticking the Include execution time box before running the query will also show the amount of time the query was executed in. The Include Node Count option will report the node count. These allow for more information, that can be used for optimizing the indexes for your application or deployment.
The purpose of the Index Manager is to facilitate index management such as maintaining indexes, or viewing their status.
It can be accessed by going to **Tools - Operations - Diagnosis **from the Welcome Screen, and then clicking the Index Manager button.
It can also be accessed directly at this URL: https://serveraddress:port/libs/granite/operations/content/diagnosistools/indexManager.html
The UI can be used to filter indexes in the table by typing in the filter criteria in the search box in the upper left corner of the screen.
This will trigger the download of a zip containing useful information about the system status and configuration. The archive contains contains instance configurations, a list of bundles, OSGI, Sling metrics and statistics and this can result in a large file. You can reduce the impact of large status files by using the Download Status ZIP window. The window can be accessed from: AEM > Tools > Operations > Diagnosis > Download Status ZIP.
From this window you can select what to export (log files and or thread dumps) and the number of days of logs included in the download relative to the current date.
This will trigger the download of a zip containing information about the threads present in the system. Information about each thread is provided, such as its status, the classloader and the stacktrace.
You also have the ability to download a snapshot of the heap, in order to analyze it at a later time. Take note that this will trigger the download of a large file, in the order of hundreds of megabytes.
The Automated Maintenance Tasks page is a place where you can view and track recommended maintenance tasks scheduled for periodic execution. The tasks are integrated with the Health Check system. The tasks can also be manually executed from the interface.
In order to get to the Maintenance page in the Operations Dashboard, you need to go to Tools - Operations - Dashboard - Maintenance from the AEM Welcome screen, or directly follow this link:
https://serveraddress:port/libs/granite/operations/content/maintenance.html
The following tasks are available in the Operations Dashboard:
The default timing for the daily maintenance window is 2 to 5 AM. The tasks configured to run in the weekly maintenance window will execute between 1 and 2 AM on Saturdays.
You can also configure the timings by pressing the gear icon on any of the two maintenance cards:
Since AEM 6.1, the existing maintenance windows can also be configured to run monthly.
For more information on performing Revision Clean Up, see this dedicated article.
By using the Lucene Binaries Cleanup task, you can purge lucene binaries and reduce the running data store size requirement. This is because the lucene’s binary churn will be re-claimed daily instead of the earlier dependency on a successful data store garbage collection run.
Though the maintenance task was developed to reduce Lucene related revision garbage, there are general efficiency gains when running the task:
You can access the Lucene Binaries Cleanup task from: AEM > Tools > Operations > Maintenance > Daily Maintenance Window > Lucene Binaries Cleanup.
For details on Data Store Garbage Collection, see the dedicated documentation page.
Workflows can also be purged from the Maintenance Dashboard. In order to run the Workflow Purge task, you need to:
For more detailed information about Workflow Maintenance, see this page.
For Audit Log Maintenance, see the separate documentation page.
You can schedule the Version Purge maintenance task to delete old versions automatically. As a result, this minimizes the need to manually use the Version Purge tools. You can schedule and configure the Version Purge task by accessing Tools > Operations > Maintenance > Weekly Maintenance Window and following these steps:
Click the Add button.
Choose Version Purge from the drop-down menu.
To configure the Version Purge task, click on the gears icon on the newly created Version Purge maintenance card.
With AEM 6.4, you can stop the Version Purge maintenance task as follows:
To stop the maintenance task means to suspend its execution without losing track of the job already in progress.
In order to optimize the repository size you should run the version purge task frequently. The task should be scheduled outside of business hours when there is a limited amount of traffic.
Custom maintenance tasks can be implemented as OSGi services. As the maintenance task infrastructure is based on Apache Sling’s job handling, a maintenance task must implement the java interface [org.apache.sling.event.jobs.consumer.JobExecutor](https://sling.apache.org/apidocs/sling7/org/apache/sling/event/jobs/consumer/JobExecutor.html)
. In addition, it must declare several service registration properties to be detected as a maintenance task, as listed below:
Service Property Name |
Description | Example |
Type |
granite.maintenance.isStoppable | Boolean attribute defining whether the task can be stopped by the user. If a task declares that it is stoppable it must check during its execution whether it has been stopped and then act accordingly. The default is false. | true | Optional |
granite.maintenance.mandatory | Boolean attribute defining whether a task is mandatory and must be run periodically. If a task is mandatory but currently not in any active schedule window, a Health Check will report this as an error. The default is false. | true | Optional |
granite.maintenance.name | A unique name for the task - this is used to reference the task. This is usually a simple name. | MyMaintenanceTask | Required |
granite.maintenance.title | A title displayed for this task | My Special Maintenance Task | Required |
job.topics | This is a unique topic of the maintenance task. The Apache Sling job handling will start a job with exactly this topic to execute the maintenance task and as the task is registered for this topic it gets executed. The topic must start with com/adobe/granite/maintenance/job/ |
com/adobe/granite/maintenance/job/MyMaintenanceTask | Required |
Apart from the above service properties, the process()
method of the JobConsumer
interface needs to be implemented by adding the code that should be executed for the maintance task. The provided JobExecutionContext
can be used to output status information, check if the job is stopped by the user and create a result (success or failed).
For situations where a maintenance task should not be run on all installations (for example, run only on the publish instance), you can make the service require a configuration in order to be active by adding @Component(policy=ConfigurationPolicy.REQUIRE)
. You can then mark the according configuration as being run mode dependent in the repository. For more information, see Configuring OSGi.
Below is an example of a custom maintenance task that deletes files from a configurable temporary directory which have been modified in the last 24 hours:
src/main/java/com/adobe/granite/samples/maintenance/impl/DeleteTempFilesTask.java
|
experiencemanager-java-maintenancetask-sample- src/main/java/com/adobe/granite/samples/maintenance/impl/DeleteTempFilesTask.java
After the service is deployed, it is exposed to the Operations Dashboard UI. You can add it to one of the available maintenance schedules:
This will add a corresponding resource at /apps/granite/operations/config/maintenance/schedule
/taskname
. If the task is run mode dependent, the property granite.operations.conditions.runmode needs to be set on that node with the values of the runmodes which need to be active for this maintenance task.
The System Overview Dashboard displays a high-level overview of the configuration, hardware and health of the AEM instance. This means that system health status is transparent and all the information is aggregated in a single dashboard.
You can also watch this video for an introduction to the System Overview Dashboard.
To access the System Overview Dashboard, navigate to Tools > Operations > System Overview.
The table below, describes all the informations displayed in the System Overview Dashboard. Keep in mind that when there is no relevant information to show (for example, backup is not in progress, there are no health checks that are critical) the respective section will display the “No Entries” message.
You can also download a JSON
file summarizing the dashboard information by clicking the Download button in the upper right-hand corner of the dashboard.The JSON
endpoint is /libs/granite/operations/content/systemoverview/export.json
and it can be used in a curl
script for external monitoring.
Section | What information is displayed | When is it critical | Links To |
Health Checks |
|
Indicated visually:
|
|
Maintenance Tasks |
|
Indicated visually:
|
|
System |
|
N/A | N/A |
Instance |
|
N/A | N/A |
Repository |
|
N/A | N/A |
Distribution Agents |
|
Indicated visually:
|
Distribution page |
Replication Agents |
|
Indicated visually:
|
Replication page |
Workflows |
For each of the statuses presented above a query is performed, with a limit of 400 milliseconds. At 400 milliseconds, the number of entries obtained up to that point is displayed. |
Not interpreted:
|
Workflow Failures page |
Sling Jobs | Sling job counts - number of jobs in a given status (if any):
|
Not interpreted:
|
N/A |
Estimated Node Counts | Estimated number of:
The total number of nodes is obtained from the nodeCounterMBean, while the rest of the statistics are obtained from IndexInfoService. |
N/A | N/A |
Backup | Displays "Online Backup in Progress" if this is the case. | N/A | N/A |
Indexing | Displays:
If an indexing or query thread is present in the thread dump. |
N/A | N/A |