Queries in Components

Because queries can be one of the more taxing operations done on an AEM system, it is a good idea to avoid them in your components. Having several queries execute each time a page is rendered can often degrade the performance of the system. There are two strategies that can be used to avoid executing queries when rendering components: traversing nodes and prefetching results.

Traversing Nodes

If the repository is designed in such a way that allows prior knowledge of the location of the required data, code that retrieves this data from the necessary paths can be deployed without having to run queries to find it.

An example of this would be rendering content that fits within a certain category. One approach would be to organize the content with a category property that can be queried to populate a component that shows items in a category.

A better approach would be to structure this content in a taxonomy by category so that it can be manually retrieved.

For example, if the content is stored in a taxonomy similar to:

/content/myUnstructuredContent/parentCategory/childCategory/contentPiece

The /content/myUnstructuredContent/parentCategory/childCategory node can simply be retrieved, its children can be parsed and used to render the component.

Also, when you are dealing with a small or homogenous result set, it can be faster to traverse the repository and gather the required nodes, rather than crafting a query to return the same result set. As a general consideration, queries should be avoided where it is possible to do so.

Prefetching Results

Sometimes the content or the requirements around the component will not allow the use of node traversal as a method of retrieving the required data. In these cases, the required queries need to be executed before the component is rendered so that optimal performance is ensured for the end user.

If the results that are required for the component can be calculated at the time that it is authored and there is no expectancy that the content will change, the query can be executed when the author applies settings in the dialog.

If the data or content will change regularly, the query can be executed on a schedule or via a listener for updates to the underlying data. Then, the results can be written to a shared location in the repository. Any components that need this data can then pull the values from this single node without needing to execute a query at runtime.

Query Optimization

When running a query that is not using an index, warnings are logged regarding node traversal. If this is a query that is going to run often, create an index. To determine which index a given query is using, the Explain Query tool is recommended. For additional information, DEBUG logging can be enabled for the relevant search APIs.

NOTE
After modifying an index definition, the index must be rebuilt (reindexed). Depending on the size of the index, this may take some time to complete.

When running complex queries, there may be cases in which breaking down the query into multiple smaller queries and joining the data through code after the fact is more performant. The recommendation for these cases is to compare performance of the two approaches to determine which option would be better for the use case in question.

AEM allows writing queries in one of three ways:

While all queries are converted to SQL2 before being run, the overhead of query conversion is minimal and thus, the greatest concern when choosing a query language will be readability and comfort level from the development team.

NOTE
When using QueryBuilder, it determines the result count by default, which is slower in Oak as compared to previous versions of Jackrabbit. To compensate for this, you can use the guessTotal parameter.

The Explain Query Tool

As with any query language, the first step to optimizing a query is to understand how it will run. To enable this activity, you can use the Explain Query tool that is part of the Operations Dashboard. With this tool, a query can be plugged in and explained. A warning is shown if the query will cause issues with a large repository and run time and the indexes that are used. The tool can also load a list of slow and popular queries that can then be explained and optimized.

DEBUG Logging for Queries

To get some additional information about how Oak is choosing which index to use and how the query engine is actually executing a query, a DEBUG logging configuration can be added for the following packages:

  • org.apache.jackrabbit.oak.plugins.index
  • org.apache.jackrabbit.oak.query
  • com.day.cq.search

Make sure to remove this logger when you have finished debugging your query. It tends to output a large amount of activity and can eventually fill up your disk with log files.

For more information on how to do this, see the Logging documentation.

Index Statistics

Lucene registers a JMX bean that will provide details about indexed content including the size and number of documents present in each of the indexes.

You can reach it by accessing the JMX Console at https://server:port/system/console/jmx

Once you are logged in to the JMX console, perform a search for Lucene Index Statistics to find it. Other index statistics can be found in the IndexStats MBean.

For query statistics, look at the MBean named Oak Query Statistics.

If you would like to dig into your indexes using a tool like Luke, you must use the Oak console to dump the index from the NodeStore to a filesystem directory. For instructions on how to do this, read the Lucene documentation.

You can also extract the indexes in your system in JSON format. To do this, you need to access https://server:port/oak:index.tidy.-1.json

Query Limits

During Development

Set low thresholds for oak.queryLimitInMemory (for example, 10000) and Oak. queryLimitReads (for example, 5000) and optimize the expensive query when hitting an UnsupportedOperationException saying “The query read more than x nodes…”

This helps avoiding resource-intensive queries (that is, not backed by any index or backed by less covering index). For example, a query that reads 1 million nodes would lead to increased I/O, and negatively impact the overall application performance. Any query that fails due to above limits should be analyzed and optimized.

Post-Deployment

  • Monitor the logs for queries triggering large node traversal or large heap memory consumption : ``

    • *WARN* ... java.lang.UnsupportedOperationException: The query read or traversed more than 100000 nodes. To avoid affecting other tasks, processing was stopped.
    • Optimize the query to reduce the number of traversed nodes
  • Monitor the logs for queries triggering large heap memory consumption :

    • *WARN* ... java.lang.UnsupportedOperationException: The query read more than 500000 nodes in memory. To avoid running out of memory, processing was stopped
    • Optimize the query to reduce the heap memory consumption

For AEM 6.0 - 6.2 versions, you can tune the threshold for node traversal via JVM parameters in the AEM start script to prevent large queries from overloading the environment.

The recommended values are :

  • -Doak.queryLimitInMemory=500000
  • -Doak.queryLimitReads=100000

In AEM 6.3, the above two parameters are preconfigured out-of-the-box, and can be persisted via the OSGi QueryEngineSettings.

More information available under : https://jackrabbit.apache.org/oak/docs/query/query-engine.html#Slow_Queries_and_Read_Limits

Tips for Creating Efficient Indexes

Should I Create an Index?

The first question to ask when creating or optimizing indexes is whether they are required for a given situation. If you are only going to run the query in question once or only occasionally and at an off-peak time for the system through a batch process, it may be better to not create an index at all.

After creating an index, every time the indexed data is updated, the index must be updated as well. Since this carries performance implications for the system, indexes should only be created when they are required.

Also, indexes are only useful if the data contained within the index is unique enough to warrant it. Consider an index in a book and the topics that it covers. When indexing a set of topics in a text, usually there will be hundreds or thousands of entries, which lets you quickly jump to a subset of pages to quickly find the information that you are looking for. If that index only had two or three entries, each pointing you to several hundred pages, the index would not be useful. This same concept applies to database indexes. If there are only a couple unique values, the index will not be useful. That being said, an index can also become too large to be useful. To look at index statistics, see Index Statistics above.

Lucene or Property Indexes?

Lucene indexes were introduced in Oak 1.0.9 and offer some powerful optimizations over the property indexes that were introduced in the initial launch of AEM 6. When deciding whether to use Lucene indexes or property indexes, take the following into consideration:

  • Lucene indexes offer many more features than property indexes. For example, a property index can only index a single property while a Lucene index can include many. For more information on all the features available in Lucene indexes, consult the documentation.
  • Lucene indexes are asynchronous. While this offers a considerable performance boost, it can also induce a delay between when data is written to the repository and when the index is updated. If it is vital to have queries return 100% accurate results, a property index would be required.
  • By virtue of being asynchronous, Lucene indexes cannot enforce uniqueness constraints. If this is required, then a property index needs to be put in place.

In general, it is recommended you use Lucene indexes unless there is a compelling need to use property indexes so that you can gain the benefits of higher performance and flexibility.