Best practices for configuring web crawlers

This article provides best practices for using robots.txt and sitemap.xml files in Adobe Commerce, including configuration and security. These files instruct web crawlers (typically search engine robots) how to crawl pages on a website. Configuring these files can improve site performance and search engine optimization.

NOTE
These best practices apply to projects using the native Adobe Commerce storefront only. They do not apply to Adobe Commerce projects that use other storefront solutions (for example, Adobe Experience Manager, headless).

Affected products and versions

All supported versions of:

  • Adobe Commerce on cloud infrastructure
  • Adobe Commerce on-premises

Adobe Commerce on cloud infrastructure

A default Adobe Commerce project contains a hierarchy that includes a single website, store, and store view. For more complex implementations, you can create additional websites, stores, and store views for a multi-site storefront.

Single-site storefronts

Follow these best practices when configuring the robots.txt and sitemap.xml files for single-site storefronts:

  • Make sure that your project is using ece-tools version 2002.0.12 or later.

  • Use the Admin application to add content to the robots.txt file.

    note tip
    TIP
    View the auto-generated robots.txt file for your store at <domain.your.project>/robots.txt.
  • Use the Admin application to generate a sitemap.xml file.

    note important
    IMPORTANT
    Due to the read-only file system on Adobe Commerce on cloud infrastructure projects, you must specify the pub/media path before generating the file.
  • Use a custom Fastly VCL snippet to redirect from the root of your site to the pub/media/ location for both files:

    code language-vcl
    {
      "name": "sitemaprobots_rewrite",
      "dynamic": "0",
      "type": "recv",
      "priority": "90",
      "content": "if ( req.url.path ~ \"^/?sitemap.xml$\" ) { set req.url = \"pub/media/sitemap.xml\"; } else if (req.url.path ~ \"^/?robots.txt$\") { set req.url = \"pub/media/robots.txt\";}"
    }
    
  • Test the redirect by viewing the files in a web browser. For example, <domain.your.project>/robots.txt and <domain.your.project>/sitemap.xml. Make sure you are using the root path that you configured the redirect for and not a different path.

INFO
See Add site map and search engine robots for detailed instructions.

Multi-site storefronts

You can set up and run several stores with a single implementation of Adobe Commerce on cloud infrastructure. See Set up multiple websites or stores.

The same best practices for configuring the robots.txt and sitemap.xml files for single-site storefronts applies to multi-site storefronts with two important differences:

  • Make sure that the robots.txt and sitemap.xml file names contain the names of the corresponding sites. For example:

    • domaineone_robots.txt
    • domaintwo_robots.txt
    • domainone_sitemap.xml
    • domaintwo_sitemap.xml
  • Use a slightly modified custom Fastly VCL snippet to redirect from the root of your sites to the pub/media location for both files across your sites:

    code language-vcl
    {
      "name": "sitemaprobots_rewrite",
      "dynamic": "0",
      "type": "recv",
      "priority": "90",
      "content": "if ( req.url.path == \"/robots.txt\" ) { if ( req.http.host ~ \"(domainone|domaintwo).com$\" ) { set req.url = \"pub/media/\" re.group.1 \"_robots.txt\"; }} else if ( req.url.path == \"/sitemap.xml\" ) { if ( req.http.host ~ \"(domainone|domaintwo).com$\" ) {  set req.url = \"pub/media/\" re.group.1 \"_sitemap.xml\"; }}"
    }
    

Adobe Commerce on-premises

Use the Admin application to configure the robots.txt and sitemap.xml files to prevent bots from scanning and indexing unnecessary content (see Search Engine Robots).

TIP
For on-premises deployments, where you write the files depends on how you installed Adobe Commerce. Write the files to /path/to/commerce/pub/media/ or /path/to/commerce/media, whichever is right for your installation.

Security

Do not expose your Admin path in your robots.txt file. Having the Admin path exposed is a vulnerability for site hacking and potential loss of data. Remove the Admin path from the robots.txt file.

For steps to edit the robots.txt file and remove all entries of the Admin path, see Marketing User Guide > SEO and Search > Search Engine Robots.

Additional information

recommendation-more-help
754cbbf3-3a3c-4af3-b6ce-9d34390f3a60