Report on LLM and AI-generated traffic

This use case article explores how to use the Customer Journey Analytics derived fields capability as a foundation to report on LLM (Large Language Model) and AI-generated traffic.

NOTE
The effectiveness of the detection methods, detection signatures and implementation strategies depends on your specific data collection method, Experience Platform dataset coverage, and Customer Journey Analytics implementation. Results may vary based on your technical environment, data governance policies, and implementation approach. When using Experience Edge, you’ll need to choose between recording the raw User Agent string or collecting device information.

Detection methods

To detect LLM and AI-generated traffic, distinguish between:

  • LLM crawlers: Collect data for training and retrieval augmented generation (RAG).
  • AI agents: Function as interfaces that perform task on behalf of humans. AI agents prefer to interact via APIs, which bypasses web analytics tracking methods. Nonetheless, you can still analyze a significant portion of AI-generated traffic through websites.

Three common core detection methods to identify and monitor LLM and AI-generated traffic are:

  • User agent identification: When a request is made to your server, the HTTP User-Agent header is extracted and analyzed against known AI crawler and agent patterns. This server-side method requires access to HTTP headers and is most effective when implemented at the data collection layer.
  • Referrer classification: The HTTP Referrer header contains the URL of the previous webpage that linked to the current request. This header reveals when users click through to your site from web interfaces like ChatGPT or Perplexity.
  • Query parameter detection: AI services can append URL parameters (particularly UTM parameters) to links. These parameters persist in the URL and can be detected through standard analytics implementations, making these URL parameters valuable indicators even in client-side tracking scenarios.

The following table illustrates how the detection methods can be used against different LLM and AI interaction scenarios.

Scenario
User agent identification
Referrer classification
Query parameter detection
Training of a model
Agent (GPTBot, ClaudeBot, and more) can be identified when server-side logging is implemented.
No classification is possible. AI crawlers don’t generate referrers during training.
Detection is impossible. AI crawlers do not add parameters during training.
Agent browsing
Agent (ChatGPT-User, claude-web) can be identified when server-side logging captures headers.
Classification is possible if the agent navigates from an AI interface with referrer preservation.
Detection is sometimes possible if the AI service adds tracking parameters.
Retrieval augmented generation (RAG) to answer query
Agent (OAI-SearchBot, PerplexityBot) can be identified with server-side logging.
No classification is typically possible as RAG operations often bypass referrer mechanisms.
Detection is rarely possible unless specifically implemented by the AI provider.
User clicks through
The agent cannot be identified. AI agent appears as a normal user agent.
Classification is possible when users click links from AI interfaces (chatgpt.com, claude.ai, and more).
Detection is possible when AI services add UTM parameters to outbound links.
Traffic visibility conditions
Require server-side logging integration with Customer Journey Analytics or server-side tagging for agent identification.
Classification depends on AI platform referrer policies and proper HTTP header transmission.
Detection requires parameter preservation through redirects and proper URL parameter collection.

Challenges

LLM & AI agents demonstrate complex and evolving behaviors when interacting with digital properties. These technologies operate inconsistently across platforms and versions. This inconsistency creates unique challenges for data professionals. The behavioral patterns vary significantly and depend on the specific AI platform, version, and interaction mode used. This operational diversity complicates efforts to track and categorize LLM and AI-generated traffic within standard analytics frameworks. The complex nature of these interactions, combined with their rapid evolution, requires nuanced detection and classification methods to maintain data integrity:

  • Partial data collection: Some newer AI agents execute limited JavaScript, resulting in incomplete analytics data for client-side implementations. As a result, some interactions are tracked while other interactions are missed.
  • Inconsistent session data: AI agents might execute JavaScript differently across sessions or page types. This execution difference creates fragmented user journeys in Customer Journey Analytics for client-side implementations.
  • Detection challenges: With partial tracking, detection becomes unreliable as certain touchpoints might be invisible to analytics.

Detection signatures

As of August 2025, the following specific signals can be identified for each of the detection methods.

User agent identification

Crawler
User Agent String
Purpose/Behavior
GPTBot
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot
OpenAI's primary web crawler for training ChatGPT and language models
ChatGPT-User
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
Used when ChatGPT browses websites on behalf of users (legacy)
ChatGPT-User v2
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/2.0; +https://openai.com/bot
ChatGPT's updated version for on-demand fetching and in-response lookups
OAI-SearchBot
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot
ChatGPT's search-focused crawler for discovering content
ClaudeBot
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ClaudeBot/1.0; +claudebot@anthropic.com
Anthropic's crawler for training and updating the Claude AI assistant
Claude-User
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-User/1.0; +Claude-User@anthropic.com)
Supports Claude AI users when individuals ask questions to Claude, it may access websites using a Cl...
Claude-SearchBot
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-SearchBot/1.0; +Claude-SearchBot@anthropic.com)
Navigates the web to improve search result quality for Claude AI users by analyzing online content t...
PerplexityBot
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)
Perplexity.ai's crawler for real-time web data indexing
Perplexity-User
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://www.perplexity.ai/useragent)
Loads pages when users click Perplexity citations (bypasses robots.txt)
Google-Extended
Mozilla/5.0 (compatible; Google-Extended/1.0; +http://www.google.com/bot.html)
Google's AI-focused crawler for Gemini separate from standard Googlebot
BingBot
Mozilla/5.0 (compatible; BingBot/1.0; +http://www.bing.com/bot.html)
Microsoft's crawler powering Bing Search and Bing Chat (Copilot)
DuckAssistBot
Mozilla/5.0 (compatible; DuckAssistBot/1.0; +http://www.duckduckgo.com/bot.html)
Scrapes content for DuckAssist, DuckDuckGo's private AI answer feature
YouBot
Mozilla/5.0 (compatible; YouBot (+http://www.you.com))
Crawler behind You.com's AI search and browser assistant
meta-externalagent
Mozilla/5.0 (compatible; meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler))
Meta's bot for collecting data to train or fine-tune LLMs
Amazonbot
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
Amazon's crawler for search and AI applications
Applebot
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)
Apple's crawler for Spotlight, Siri, and Safari
Applebot-Extended
Mozilla/5.0 (compatible; Applebot-Extended/1.0; +http://www.apple.com/bot.html)
Apple's AI-focused crawler for future AI models (opt-in)
Bytespider
Mozilla/5.0 (compatible; Bytespider/1.0; +http://www.bytedance.com/bot.html)
ByteDance's AI data collector for TikTok and other services
MistralAI-User
Mozilla/5.0 (compatible; MistralAI-User/1.0; +https://mistral.ai/bot)
Mistral's real-time citation fetcher for "Le Chat" assistant
cohere-ai
Mozilla/5.0 (compatible; cohere-ai/1.0; +http://www.cohere.ai/bot.html)
Collects textual data for Cohere's language models

Referrer classification

Source
Referrer
Traffic Type
ChatGPT
chatgpt.com
Direct traffic from ChatGPT interface
Claude
claude.ai
Traffic from Anthropic's Claude interface
Google Gemini
gemini.google.com
Traffic from Google's AI assistant
Microsoft Copilot
copilot.microsoft.com
Traffic from Microsoft's AI assistant
Microsoft Copilot
m365.cloud.microsoft
Traffic from Microsoft's AI assistant (Microsoft 365 cloud services)
Perplexity AI
perplexity.ai
Traffic from AI search with citations
Meta AI
meta.ai
Traffic from Meta's AI assistant

Query parameter detection

LLM Service
Example URL
Query Parameter
Example Value
ChatGPT
https://www.yoursite.com/product?utm_source=chatgpt.com
utm_source
chatgpt.com
Perplexity
https://www.yoursite.com/article?utm_source=perplexity
utm_source
perplexity

Implementation

You can report on LLM and AI-generated traffic within a typical Customer Journey Analytics setup (connection, data views, and workspace projects) through the specific setup and configuration of derived fields, segments, and workspace projects.

Derived fields

To configure detection methods and detection signals use derived fields as the foundation. For example, define derived fields for user agent identification, query parameter detection, and referrer classification.

LLM/AI user agent identification

Use the Case When derived field functions to define a derived field that identifies LLM/AI user agents.

LLM/AI User Agent Identification {modal="regular"}

LLM/AI query parameter detection

Use the URL Parse and Classify derived field functions to define a derived field that detects query parameters.

LLM/AI UTM Parameter Detection {modal="regular"}

LLM/AI referrer classification

Use the URL Parse and Classify derived field functions to define a derived field that classifies referrers.

LLM/AI Referrer Classification {modal="regular"}

Segments

Set up dedicated segments that help you to identify events, sessions or people related to LLM and AI-generated traffic. For example, use the derived fields that you created earlier to define a segment that identifies LLM and AI-generated traffic.

LLM and AI-generated traffic segment {modal="regular"}

Workspace project

Use the derived fields and segments to report and analyze on LLM and AI-generated traffic. For example, see the annotated project below.

LLM and AI-generated traffic Workspace project {modal="regular"}

recommendation-more-help
080e5213-7aa2-40d6-9dba-18945e892f79