Delivering commerce experiences at scale
Learn how to set up and configure Adobe Experience Manager and Adobe Commerce using the Commerce Integration Framework to deliver high-performing experiences under load.
Transcript
Hi everyone, I’m Grant Naber, Customer Success Engineer Manager, Adobe Commerce Merit Services team. As I mentioned, over the past year we’ve been running performance tests on an AEM Venier store connected via SIF to an Adobe Commerce cloud instance to simulate consistent high load and surges in traffic. The findings from our project have shown the response times we can expect from an out of the box setup. During this project it has also shown us several optimizations that are possible from default settings on AEM and Adobe Commerce cloud in order to better handle high levels of load. This presentation will outline at a high level some of the challenges that we’re faced when simulating high load and then how we solve them with relatively simple configuration changes. Please note that the AEM settings in this presentation are specific to AEM on Adobe managed services or on premise hosting. Some settings on AEM as a cloud service are automated and therefore do not need to be manually configured. So my colleagues Caleb and Dominique will be available during this session to answer any questions you would like to post as we’re going through in the Q&A chat channel. Full details on the results of our testing are available in this Adobe white paper which is available on the Adobe website. The URL is included in this presentation. Other links to other more in-depth documentation are also included throughout. So what is SIF? So SIF is the Commerce Integration Framework. SIF enables AEM to directly access and communicate with an Adobe Commerce instance using Adobe Commerce’s GraphQL APIs. This integration approach leverages the strengths of each application, the authoring, personalization and omni-channel capabilities of AEM and the e-commerce operations of Adobe Commerce. The AEM dispatcher is a reverse proxy that helps deliver an environment that is both fast and dynamic. It works as part of a static HTML server such as Apache with the aim of storing or caching as much of the site content as possible in the form of static resources. This approach aims to minimize the need to access the AEM publisher’s page rendering functionality and the Adobe Commerce GraphQL service as much as possible. Within SIF there is also support for server-side and client-side communication patterns. So as well as utilizing AEM’s dispatcher in the caching strategy, it is also possible for the dispatcher to proxy uncacheable requests directly to Adobe Commerce via the Fastly CDN. This allows for fully dynamic and real-time content from Adobe Commerce to be displayed within the cached AEM web page itself. So here we’ve got an example of a PDP or a product detail page on an AEM front end which is connected via SIF to a headless Adobe Commerce backend. Commerce pages such as the product detail page and product listing pages are unlikely to change frequently and so can be rendered once on the AEM publisher and then cached in the dispatcher for future visitors. The SIF core components on the AEM publishers get data from Adobe Commerce via GraphQL APIs. These pages are created once and then cached on the AEM dispatcher and then delivered to the browser. Where more dynamic attributes such as stock levels, availability or price are displayed, side-by-side components can be used. So we can see an example of this on this PDP page. So the areas in purple are calls to Adobe Commerce from the publisher during the building of the page itself. So examples of this are the product description, the product attributes and the navigation. So once this is built once on the publisher, it’s then subsequently cached in the dispatcher. So no further request to or load is placed on Adobe Commerce. The areas in red are real-time calls to Adobe Commerce from the client browser itself. So data here is always kept up to date, but it does result in repeated requests and load on Adobe Commerce. However, this is necessary for things like the shopping cart and also price information if it is intended to be kept up to date in real time. So in technically developing a caching strategy, you should first meet with your business users to define and agree acceptable caching lengths for TTL for different content areas. Whilst initially the ask from the business may be to have any updated content immediately available on the site, it should be explained around the end user performance benefits of having an effective caching strategy in place. The impact of different content areas should be explained with the impact on potential revenue loss if stale content is incorrectly served. We can see an example here. So in the left column, we’ve got the impact if stale content is served from low to very high, the caching area, which shows the content area of the site, how often that content is likely to change, and then the acceptable caching TTL or time to live that the business can accept on that content. So if we look at the low impact areas, we can see things like the site content HTML pages, which are updated by the CMS. And we can also see site content templates, logos, CSS design images. So that’s because it doesn’t change very often. The business could usually accept a long TTL on that of a day to a week length. The other end of the scale, we’ve now got the very high impact if stale content is served. And this is to do with the checkout shopping cart and payment pages. So on these, every user is unique. And so no caching can be put in place here. We’ve then got areas in the middle of the scale. So things like prices, even though here it’s flagged as being updated frequently, and no cash is suitable, it could be that the business could agree to a daily price update at a scheduled time on the site. And in which case you could then flag this as something which could also be cached as well. So it depends on what the business is willing to accept on that. So we’re now going to go through some AEM dispatcher optimizations. So problem, caching expiry is not automated and some assets never or very rarely change, and continue to result in load on AEM. So by default, AEM uses a dispatcher flush agent, which is a manual caching approach. By setting a TTL based caching, it becomes an automated expiry for content. An empty file called a stat file is placed next to each to cache file which is created with a one time equal to the expiry date. When the cached file is requested past this time is automatically re requested from the publisher. This gives an effective caching mechanism which requires no manual intervention or maintenance. Once the products update delay or TTL has been acknowledged and accepted by the business stakeholders as per the expiry values on the previous caching planning slide. You can set the caching time duration using parameters such as max age in seconds, a daily update, weekly or monthly. Some assets are very unlikely to change and therefore even requests to the dispatcher can be reduced by caching relevant files locally on a user’s browser. For example, the site’s logo which is displayed on every page on the site in the template would not need to be requested each time to the dispatcher. This instead can be stored on the user’s browser cache. The reduction in bandwidth requirements for each page load would have a positive impact on site responsiveness and page load times. This can be done using the cache control max age response header and would then result in a 304 not modified response code instead of a 200. Some areas of the site where this can be set to be cached in the client’s browser include images within the AEM template itself, for example the site logo and template sign, CSS files and site JavaScript files including the CIF JavaScript files itself. So problem, a change to a page or a file in AEM invalidates the whole cache, slowing down the site. The default dispatcher configuration uses stat file level 0 setting. This would mean that if a file is modified with the default settings everything will be modified in every directory level. For a busy e-commerce site this will place an unnecessary amount of load on the AEM publishers for the whole site structure to become invalidated with only a single page update. Instead the stat file level setting can be modified to a higher value corresponding to the depth of the subdirectories in the htdocs directory from the document root so that when a file located at a certain level is invalidated that only files at that stat level and below are updated. For example let’s say you’ve got a product page template at this location here. Each folder level would have a stat level so if we break down that URL we can see the stat levels within this table here. In this case if you would set the stat file level property to the default of 0 and the product page HTML template is updated then every stat file from the dot root to level 4 would be invalidated which would cause a further request from the AEM publishers and instances for all pages across the site for that single change. However if the stat file level property is set to level 4 and a change is made to the product page HTML then only the stat file in the products directory for that specific website country or language would be touched. So problem updating a high traffic page on the site causes the publishers to become overloaded under extreme load conditions. Another dispatcher setting to optimise when configuring the stat file level is the grace period setting. This defines the number of seconds that a stale auto-invalidated resource may still be served from the cache after the last activation occurred. Not having grace period set could lead to a situation where a page with extreme load may result in many concurrent requests to the AEM publishers for that page, potentially slowing down or overloading the system whilst the publishers are trying to build the page and store it in the dispatcher cache. Setting the grace period setting to 2 seconds for example can be used to prevent this scenario, continuing to serve the old version of the page from the dispatcher cache whilst the publishers are building the new version. No problem a pay-per-click campaign or a viral social media campaign causes heavy publisher load and is bypassing the dispatcher cache. So e-commerce sites may drive traffic to their sites using pay-per-click search adverts or social media campaigns. Use of these medians will mean that a tracking ID is added onto the outbound link from that platform. For example, Facebook will add a Facebook click ID to the URL and Google adverts will add a Google click ID. This will make incoming links to your AEM front end appear like this as an example. The g-clip will change for every user that clicks the link. This is intended for tracking purposes but with its default settings AEM would see every request as a unique page which would therefore bypass the dispatcher and generate unnecessary extra load on the publishers and on Adobe Commerce. During a surge event this can even cause the AEM publishers or Adobe Commerce to become overloaded and unresponsive. To solve this problem you should therefore configure to ignore all marketing tracking parameters by default using the ignore URL parameters setting. When a parameter is set to be ignored the page is cached the first time that it is requested. Subsequent requests for the page are served the cached page regardless of the value of the parameter in the request. Further ideas for dispatcher optimizations can be found by using the dispatcher optimization tool which is available from the following link. So now we’re going to look at some AEM publisher optimizations that are possible. So problem, uncacheable pages place excessive load on the publishers. So individual components within AEM can be set to be cached meaning that a GraphQL request to Adobe Commerce is called once and then subsequent requests up to the configured time limit are retrieved from the AEM cache and would not place further load onto Adobe Commerce. Examples on where this would be useful would be a site navigation based on a category tree shown on all pages and also a search page with the faceted search filters in place which you can see from this example screenshot here. These are just two areas which require resource intensive queries on Adobe Commerce to build yet would be unlikely to change regularly and therefore would be good choices for caching. With this configured even when an uncached PDP is built by the publisher the call to rebuild the site navigation would not need to be repeated to Adobe Commerce as that component would already be cached from another page that has been built and cached previously. Therefore the resource intensive GraphQL request for the navigation build would not hit or put load on Adobe Commerce and could be retrieved from the GraphQL cache on the AEM SIF. The first example shown here is for the navigation component to be cached because it sends the same GraphQL query on all pages on the site. This caches the past 100 entries for 10 minutes for the navigation structure. The second example below caches the past 100 faceted search options in a search page for one hour. This function would be useful in this sample page here with the faceted search functionality to the left side of the screen. So AEM publisher optimizations problem checking fastly on Adobe Commerce Cloud shows that GraphQL requests are not being cached. So to solve this you should set the default HTTP method from POST to GET in SIF. The reason for this is that only GET requests are cached in Adobe Commerce Cloud fastly. POST requests will bypass the cache with default settings. Another problem, Adobe Commerce starts to return 503 errors even though the AEM environment is still able to handle more load. Initially the max HTTP connections in the OSGI settings should be set to the default fastly maximum connections limit which is currently set to 200. Even if there are multiple publishers in the AEM farm the limits should be set to the same across each publisher matching the fastly setting. The reason for this is that in some cases one publisher could be handling more traffic than the other publishers. If an associated dispatcher is taken out of the farm for example this would mean that all traffic will be routed through the single remaining dispatcher and publisher. In this case the single publisher would then need all of those HTTP connections available. You should also set publisher GraphQL connection limits and timeouts to match fastly settings. So now we’re going to look at some Adobe Commerce Cloud optimizations. So problem, multiple queries in a single GraphQL request are not being cached at fastly. GraphQL allows you to make multiple queries in a single call. It is important to note that if you specify even one query that Adobe Commerce does not cache, for example cart or customer orders with many others that are cacheable, for example categories and products, fastly will bypass the cache for all queries in the call. This should be considered by developers when combining multiple queries to ensure that potentially cacheable queries are not unintentionally bypassed. No problem, Adobe Commerce indexes are running during heavy traffic times on the site which is using up system resources and slowing down performance. By default indexes in Adobe Commerce are running during heavy traffic times on the site which By default indexes in Adobe Commerce are set to update on save. This wouldn’t be an issue for a low traffic site which is not regularly updated but for a site with heavy load and many updates this setting would cause issues with site response times. For sites expecting heavy load all indexes should be set to update on schedule. This setting performs targeted updates to only updated indexes in the background with the use of cron jobs and will not impact the Adobe Commerce response times or site performance. Problem, a multinational e-commerce store is showing a high percentage of hits to the origin which is bypassing fastly. To resolve this you should enable origin shielding to reduce traffic to the origin. The purpose of fastly’s origin shielding is to reduce traffic to the Adobe Commerce origin. When a request is received fastly adds a location or a point of presence, checks for cached content and provides it. If it’s not cached it continues to the shield point of presence to check if it’s cached there. If the content has been previously requested even from another global point of presence it will be cached. So you can see this in this example here. So we’ve got a user in France, in America and in Japan. So each of them have their own local edge locations which in turn then call the Magento origin server. With origin shielding enabled you would have a single origin shield defined which in this case is Fastly Virginia. So the user in America would hit that Fastly Virginia edge location which would pull in the content from the Magento origin server. But every other origin, every other edge location instead of hitting the origin server directly would be calling the edge, would be calling the origin shield in Fastly Virginia. So this reduces load on the e-commerce on the Magento site. You should always choose the origin shield closest to your origins data center and not your visitor’s location. Although usually these should be the same. No problem. Slow loading time and high load on Adobe Commerce Cloud servers when serving catalog images. So image optimization. This offloads all the serving of catalog product images to the Fastly CDN, as well as resource intensive product catalog images transformation processing onto Fastly and off from the Adobe Commerce origin. To enable this you need to have proxy, to enable this you need to proxy catalog image URLs through the AEM dispatcher directly to Magento and also you need to have origin shielding enabled. End user response times are improved for page load times as catalog images are transformed to a device optimized size and format at the edge location, which eliminates latency by reducing the number of requests back to the Adobe Commerce origin. You may see something similar like this in New Relic to the attached database response time when under load. You see the yellow bar increasing with the time it takes for the database responses to happen. So by default MySQL slave connections are not activated in Adobe Commerce on Cloud. This is because these settings are only suitable for customers that are experiencing very high load. The cross availability zones latency is higher with slave connections activated. And so this setting actually reduces performance of an Adobe Commerce on Cloud instance in the case it is receiving only regular load levels. If the Adobe Commerce instance is expecting heavy load, then activating the master slave for MySQL will help with performance by spreading out the read only load on the MySQL database across the different nodes. As a guide on environments with normal load levels, enabling slave connections here may slow down performance by 10 to 15%. But on clusters with heavy load and traffic, there’s usually a performance boost of around 10 to 15%. So therefore it’s really important that you load test your environment with the expected traffic levels to evaluate if this setting will be beneficial to your performance response times under load. You can enable slave connections with a simple change to the magento.app.yaml file. So now we’ll look at some infrastructure optimizations. So to reduce latency between the AEM publisher and Adobe Commerce GraphQL when building pages, the initial provisioning of the two separate infrastructures should be hosted within the same cloud region. The geographical location chosen for both clouds should also be closest to the majority of your customer base. So the client side GraphQL requests to the Adobe Commerce Cloud origin are served from a nearby location. Problem during traffic surges, dispatchers and publishers drop out of the load balancers. So assuming there is a load balancer in the AEM infrastructure and multiple dispatchers and publishers, the following settings should be considered. Publisher health checks should be reviewed to prevent dispatchers dropping out of the service unnecessarily early from load surges. The timeout setting of the load balancer health checks should also be aligned with the publisher timeout settings. Problem, setting sticky sessions on the AEM load balancer causes requests to fastly not to be cached. So the dispatcher target group stickiness should be disabled and round-robin load balancing should usually be used. This is assuming there is no AEM specific functionality or AEM user sessions used that would require session stickiness to be set. It also assumes that user login and session management is done only on Adobe Commerce via GraphQL. Please note, if you do enable session stickiness, this may cause requests to fastly not to be cached as by default fastly does not cache pages with the set cookies header. So if after all the configurations above, load test results or analysis of the live infrastructure performance still indicates that the load levels to Adobe Commerce are at a level which consistently maxes out the CPU and other system results. If you do not have the resources, then a move to a split architecture on Adobe Commerce Cloud can be considered. With the standard Adobe Commerce Cloud Pro architecture, there are three nodes, each of which contains a full tech stack. By converting to a split tier architecture, this changes to a minimum of six nodes, three of which become the database nodes and then the other three are reserved for processing web traffic. So this means that there’s greater possibilities for scaling with a split tier. So core nodes containing databases can be scaled vertically very quickly and web nodes can be scaled horizontally and also vertically, which gives a large amount of flexibility to expand the infrastructure on demand, set periods of high load and activity on nodes where the extra resources are needed. So if a decision has been made that you’d like to try out a split tier architecture, then you should have a discussion with your customer success manager to enable this. So to conclude, thank you very much for your time and further details on all the recommendations included in this presentation. Please go and see the full white paper. There’s a link available in here, which I believe the presentation will be posted after this session. And also the white paper also includes some suggestions on an order per hour based load testing approach and some other setting optimizations in more detail to ready your AEM, SIF and Magento commerce cloud site for load. So thank you very much for your time.
Additional Resources
recommendation-more-help
3c5a5de1-aef4-4536-8764-ec20371a5186