Sitemaps

Last update: 2024-01-25
  • Created for:
  • Intermediate
    Developer

Learn how to help boost your SEO by creating sitemaps for AEM Sites.

WARNING

This video demonstrates use of relative URls in the sitemap. Sitemaps should use absolute URLs. See Configurations for how to enable absolute URLs, as this is not covered in the video below.

 Transcript

So I want to quickly show you how to enable AEM core component Sitemaps functionality for AEM sites.

And this allows you to generate Sitemaps for search engines, like Google, to help bolster your sites visibility, inside search engines.

In this video, we’ll be adding a Sitemap to the WKND website that’s built using core components. And it’s important to note that the WKND’s site pages use the AEM core components page component. As this component type is automatically registered to support Sitemaps.

Sitemaps are considered an opt-in feature, meaning we need to explicitly enable them. So there are two ways to do this. The first is configure the Sitemaps scheduler to generate Sitemap files at a defined frequency. And usually we set this to be about once a day. Using this method Sitemaps are created using a background process and cache in AEM, and then served out when requested. The second way is to enable on demand Sitemap generation. In using this method, the Sitemap is regenerated every time a request for the Sitemap is made. Now, unless you have a very small site, you almost always want to use the schedule generation approach in production. And the reason for this is threefold. First, as you might guess, as your site grows large, so does the cost to generate the Sitemap. So ideally we generate the Sitemap as few times as we need to, and we do it in the background so it doesn’t affect other processes. Second Sitemaps are typically crawled infrequently by search engines and by infrequently, I mean about once a day or even less. If you know your Sitemaps are being crawled more frequently, or if you’re just more comfortable trying to guarantee a higher level of freshness in your Sitemaps, you can of course increase the frequency the Sitemap scheduler runs at. And the third reason to prefer schedule generation is that scheduled Sitemap generation runs as a background process. And because of this, it allows AEM to optimize how it generates and serves the Sitemaps. Whereas on demand Sitemap generation can’t offer a lot of these optimizations due to certain constraints imposed by executing in the context of an HTTP request. And we’ll touch on some of these optimizations a little bit later in the video. So on demand generation is okay for use with small sites, but it also comes in quite handy when testing out Sitemap configurations during development. Since when we use it, we don’t have to wait for the scheduler to run, to see our changes reflected. So in this video, we’ll manually enable ad hoc Sitemap generation on our AEM SDK published service for testing, but configure the scheduler for use on our AEM as a cloud service environment.

Okay, so I have an AEM SDK author and published service running locally with WKND site installed. Let’s get started by enabling the ad hoc Sitemap generation on the SDK.

Typically, Sitemaps are only useful on AEM Publish, since that’s where search engines have access to crawl. So let’s configure our Sitemaps there. Log into your AEM Publish SDK’s OSGI web console, and navigate to the configuration manager and locate the Apache Sling Sitemap, Sitemap generator manager configuration.

Opening this allows us to enable the on-demand Sitemap generation. So let’s check this check box. And keep in mind that we’re making this change directly in the AEM SDK just for local development purposes, but if you do choose to use on demand Sitemap generation for your AEM as a cloud service environments, you’ll have to create an OSGI configuration for this, edit your project and deploy it to AEM’s cloud service.

With this configured, let’s jump over to AEM Author to select our Sitemap endpoints. So when working with Sitemaps in AEM, we must first decide which AEM site tree we want represented by a Sitemap. So let’s say we want each WKND country to have a Sitemap. So WKND US sites will have its own Sitemap, WKND Canada will have their own, and so on and so forth. So for this video, we’ll configure it for the US sites. So we’ll select the US page and open its page properties.

Navigate to the advanced tab.

And then down at the bottom select generate Sitemap. And save your changes.

Let’s publish these changes to AEM Publish.

Let’s jump over to AEM Publish. And now we can request the page path that we just marked with generate Sitemap with the .sitemap.xmlselectorandextension.

And this will return a standards compliant XML Sitemap, and this will contain entries for all of the pages under the US pages content hierarchy.

So since the WKND US sites are admittedly small, it’s fine to have a single Sitemap for them. However, if sites grow large, you can leverage AEM’s Sitemap index file capabilities. And to do this, all we have to do is request that top level Sitemap page with sitemap-index.xml, instead of these .sitemap.xmlselectorandextension. And this will return a standards compliant Sitemap index file that in turn lists all the Sitemap files that represent the site. The index file lists the Sitemap for the top level Sitemap page, as well as any URLs the Sitemap files for pages under the top level page that may have been marked with generate Sitemap as well. So let’s take a look at how we can use this approach to split our Sitemap up.

We’ll head back to AEM Author, and let’s go ahead and mark the adventures page as having its own Sitemap. So like before, we simply select it, open its page properties, select the advanced tab, and select generate Sitemap and save.

Again, we can publish these changes.

And back on AEM Publish, we can refresh our sitemap-index.xml URL.

Now you can see the Sitemap index has two Sitemap entries in it. The first being the URL to the adventure Sitemap and the second for everything else under the US page. So let’s open each of these URLs.

As we can see the adventure Sitemap contains only the pages under content wknd/us/en/adventures.

And the US Sitemap contains everything else under /content/wknd/us, excluding the URLs already covered by the adventure Sitemap.

There’s a couple of cool things that the Sitemap index brings to the table. The first is each Sitemap can be configured to be created using custom generators. This allows custom logic to dictate which entries are included in the Sitemap. The second is this allows different Sitemaps to be configured to refresh at different schedules. So for example, we could set the US Sitemaps to refresh daily, but have the adventure Sitemap refresh hourly.

And third, if the Sitemap files get too big, so over 10 meg or 50,000 entries, AEM will automatically split the Sitemap into chunks. And the Sitemap index file will automatically include links to each Sitemap chunk. Note that these last two only apply to Sitemaps generated using a scheduler and not via on-demand generation.

So next, I want to show how we can exclude error pages from the Sitemap.

So back in AEM Author, we can navigate to the pages that we want to exclude. And in this case, let’s exclude our two error pages. So to do this, we’ll simply open the page properties of the page to exclude, head over to the advanced tab, and right above generate Sitemap is a robots tags list.

Let’s add the noindex option. And noindex instructs the Sitemap generator to exclude this page from the Sitemap.

We can save our changes. And let’s do the same thing to the other error page. And note that this noindex setting does not cascade down the content tree. So we couldn’t just have set this on the parent errors page. A quick note here, I’ve been breaking live copy inheritance to set some of these values, but in reality, we’d probably be making these in the language master and rolling them out. So we wouldn’t have to do this for all our site variations, but this video isn’t about MSM and Live Copy. So I don’t want to confuse things and I’ll just apply these directly. Once we’ve marked our error pages, noindex, we can publish them.

And then check our Sitemaps back on AEM Publish.

I haven’t refreshed the US Sitemap from the last time we loaded it. So you can see that we still have our errors 404 and errors 500 pages listed. So let’s go ahead and refresh it, now that we’ve added noindex to those two pages.

And there we go, we can see our 404 and 500 error pages are no longer in the Sitemap.

Let’s switch gears and make changes to the WKND project source code, such that the Sitemaps can be used in AEM as a cloud service environments. So, I’ll open the WKND code base up here in my IDE.

And as mentioned before, we need to ensure we enable Sitemap generation. And since this is for AEM’s Cloud Service, we’ll use the Sitemap scheduler to generate our Sitemaps. So let’s configure this. To do this, we need to define an OSGI factory configuration that specifies the frequency AEM will generate our Sitemaps. So, let’s head over to the UI config project and create that. Since we only want to generate Sitemaps on AEM Publish, since that’s where they’re being consumed, we’ll add the OSGI configuration in the config.publish folder. We’ll use the Sitemap scheduler’s OSGII persistent PID as the file name.

Tilde as the delimiter And then post fix it with a semantic identifier. In this case, WKND is a pretty good name. And of course, we’ll use the recommended cfg.json format. There are three main, there three main configurations we need to add to get a basic Sitemap up and running. The first is the scheduler.name, and this is just a name to help us identify what the scheduler does. So it’s always good to give it a semantic name. Let’s call ours WKND Sitemaps. Next, is schedule that expression. And this is quite important as it defines the Cron expression that determines when the scheduler will run and generate the Sitemaps. So whatever we put here, will define the interval that our Sitemap is refreshed. So let’s set ours to run daily during off hours and just run it at 2:00 AM.

Lastly, is the searchPath. And the searchPath defines the AEM paths who Sitemaps the schedule will generate. So for here, we can just scope it to /content/wknd. Now, when are WKND app is deployed to AEM Publish, AEM will regenerate the Sitemaps on a daily basis based on our scheduler expression. The other configurations we need to make to the code project involves small updates to the AEM Dispatcher configurations. Previously, you may have noticed that I’ve been accessing AEM Publish directly on local O4503, but as we all know, AEM Publish is always accessed via AEM Dispatcher outside of our local dev environment. And by default, AEM Dispatcher doesn’t allow the Sitemap or Sitemap index selectors. So we need to go to the Dispatcher project and open up the filters.any file and add in allow rule per request to the Sitemap end points.

So let’s add a new allow rule here.

We’ll set the type to allow.

We’ll set the path to p/content/star.

Next we’ll set the selectors. And for this, we’ll add the sitemap-index or Sitemaps selectors. And this is very important, since we’re using the or syntax, you must use single quotes around this value.

And finally, we’ll add the extension, which will be XML.

The last change we need to make is to the rewrite.rules file in our Dispatcher project. And this is required due to the URL shortening WKND uses. Which hides the /content/wknd prefix from it’s public URLs. So here, let’s add the .XML extension. So any incoming requests with the .XML extension are rewritten to serve off AEM’s resources under /content/wknd. Note that depending on in your implementation, you may not want all XML requests to be routed in this manner. So you could certainly be more specific and enumerate the requests must end with .sitemap/index.xml or .sitemap.xml. But since WKND doesn’t use other XML requests. Let’s keep it simple and just match on the .xml. Okay. Let’s commit these changes and deploy our Sitemap enabled WKND site to AEM as a cloud service to see it work on the cloud.

All right, the updated WKND app is now deployed to AEM’s cloud service using Cloud Manager, and I’ve made the same changes to the site’s page properties as we did to our local AEM SDK. So here I’ve opened up the WKND site running on the AEM’s cloud service published service and let’s request its Sitemap index file.

And here we go, we have that Sitemap index file pointing to the two discrete Sitemap files that comprise the totality of the WKND US site.

And opening each of these Sitemap URLs renders the respective Sitemap.

So now we can submit the sitemap index URL to our favorite search engines for indexing and give our SEO a boost. Okay. I think that’s about it for getting a basic Sitemap up and running in the AEM. Do note that the Sitemap functionality is quite powerful and customizable allowing you to implement your own generators and filters, giving you complete control over what your Sitemap generates. To learn more about these advanced capabilities, check out the supporting documentation. -

Configurations

Absolute sitemap URLs

AEM’s sitemap supports absolute URL’s by using Sling mapping. This is done by creating mapping nodes on the AEM services generating sitemaps (typically the AEM Publish service).

An example Sling mapping node definition for https://wknd.com can be defined under /etc/map/https as follows:

Path Property name Property type Property value
/etc/map/https/wknd-site jcr:primaryType String nt:unstructured
/etc/map/https/wknd-site sling:internalRedirect String /content/wknd/(.*)
/etc/map/https/wknd-site sling:match String wknd.com/$1

The screenshot below illustrates a similar configuration but for http://wknd.local (a local hostname mapping running on http).

Sitemap absolute URLs configuration

Sitemap scheduler OSGi configuration

Defines the OSGi factory configuration for the frequency (using cron expressions) sitemaps are re/generated and cached in AEM.

ui.config/src/main/jcr_content/apps/wknd/osgiconfig/config.publish

{
  "scheduler.name": "WKND Sitemaps",
  "scheduler.expression": "0 0 2 1/1 * ? *",
  "searchPath": "/content/wknd"
}

Dispatcher allow filter rule

Allow HTTP requests for the sitemap index and sitemap files.

dispatcher/src/conf.dispatcher.d/filters/filters.any

...

# Allow AEM sitemaps
/0200 { /type "allow" /path "/content/*" /selectors '(sitemap-index|sitemap)' /extension "xml" }

Apache webserver rewrite rule

Ensure .xml sitemap HTTP requests are routed to the correct underlying AEM page. If URL shortening is not used, or Sling Mappings are used to achieve URL shortening, then this configuration is not needed.

dispatcher/src/conf.d/rewrites/rewrite.rules

...
RewriteCond %{REQUEST_URI} (.html|.jpe?g|.png|.svg|.xml)$
RewriteRule ^/(.*)$ /content/${CONTENT_FOLDER_NAME}/$1 [PT,L]

Resources

On this page