The data quality dilemma in Adobe's data collection

10 minutes

style

article-header-section

Adobe’s data collection process spans multiple stages, each capable of transforming your data, which makes it powerful but also complex when issues arise. This article explains what happens at each step and offers a practical framework for deciding where data quality rules should live so you can reduce confusion, build trust in your reporting, and resolve issues faster.

Navigating the multi-stage transformation pipeline: A strategic guide to managing data transformations across Adobe's ecosystem

If you’ve worked with Adobe’s data collection stack for any time, you’ve likely seen it: a metric looks wrong, a field is missing, or a segment returns unexpected results. When you start investigating, you realize the change could have happened in Launch, Datastream, AEP, Data Distiller, or CJA. It becomes an archaeological dig through multiple systems to figure out where the data shifted.

This isn't a flaw in Adobe's design so much as a consequence of its flexibility. The platform gives you transformation capabilities at nearly every stage, which is powerful — but without a clear strategy for what belongs where, that flexibility becomes a source of persistent confusion and technical debt.

This article aims to help you develop that strategy. By the end, you'll have a practical framework for deciding where specific data quality work belongs across the pipeline, how to avoid common placement mistakes, and how to structure your setup so that future debugging doesn't require an expedition.

To make sense of the full picture, it helps to first understand why Adobe's pipeline doesn't behave like a traditional ETL process — and then look at a model for thinking about where responsibility for data quality should sit at each stage.

First, let's talk about “E-T-T-T-L-T-L-T”

Before we dive in, it's worth addressing what I mean by this acronym and why traditional ETL thinking doesn't map cleanly here.

Breaking down the E-T-T-T-L-T-L-T:

E (Extract): Data is extracted from user interactions — the browser captures user behavior, page context, and interaction events
T (Transform 1): Adobe Launch applies business rules and validation
T (Transform 2): Dynamic Datastream filters and gates
T (Transform 3): Datastream server-side maps and normalizes
L (Load 1): Data is loaded into AEP Raw Dataset
T (Transform 4): Data Distiller transforms raw to curated
L (Load 2): Curated data is loaded/connected to CJA
T (Transform 5): CJA Data views apply semantic adjustments

Why traditional ETL thinking fails here

Adobe Launch isn't “extracting” in the traditional Data Warehouse sense — it's capturing and generating event data based on user interactions. The extraction is passive (reading DOM, cookies, dataLayer), but the primary function is event production and business rule application.

Loading happens multiple times:

Ingestion into AEP Raw Dataset — First persistent storage
Materialization of Curated Dataset — Data Distiller creates new datasets
Connection to CJA — Curated dataset made available to CJA (more of a “connect” than a “load”)

A more practical mental model

While E-T-T-T-L-T-L-T accurately describes the technical flow, thinking in terms of stages and their purposes is often more useful in day-to-day work:

Stage 1 (Launch): Collection & Business Rules
Stage 2 (Dynamic Datastream): Ingestion Filtering
Stage 3 (Datastream): Pre-Storage Normalization
Stage 4 (AEP): Dataset Curation
Stage 5 (CJA): Presentation Layer

The key insight isn't perfect terminology — it's recognizing that you have transformation capabilities at five distinct stages, and having a strategy for which transformations belong where. That's what this guide is about.

The guiding principles

Before looking at specific stages, it's useful to establish a few foundational principles that tend to hold across well-structured Adobe implementations:

1. Raw = immutable

Your raw event data is generally best stored exactly as it arrives. This supports auditability, enables backfilling, and gives you a reliable debugging foundation. Destructive changes at the raw layer create problems that are difficult to trace later.

2. Single source of truth

Production reports and downstream processes are ideally read from one curated dataset layer (typically in AEP). This is your “golden” dataset — the version everyone trusts.

3. Transform once, present many

Correcting and standardizing data at one point in your pipeline (preferably the curated dataset layer) keeps things simpler. Reporting tools are then left to apply presentational or contextual adjustments only.

4. Clear ownership

Every layer benefits from a clear owner — whether that's client-side developers, data engineering, or analytics teams. When ownership is ambiguous, data issues tend to slip through unaddressed.

5. Observability & data quality KPIs

Automated checks for completeness, schema conformance, duplicates, and latency help issues surface early. Monitoring is most effective when it's built in from the start rather than added reactively.

The strategic framework: What should happen where?

Here's a breakdown of each stage with guidance on its purpose and the transformations that tend to belong there.

The foundation: XDM schema as your data quality contract

Before you implement any transformations, it helps to have a solid foundation. Your XDM (Experience Data Model) schema isn't just documentation — it's your first and most important line of defense for data quality.

Think of your XDM schema as a contract between all stages of your pipeline. It defines what data that you expect to receive, what format it should be in, what constraints it must satisfy, and what's required as optional. Everything else in your pipeline either enforces this contract or builds upon it.

Stage 1: Adobe Launch (collection & business rules)

Primary purpose: Capture user interactions and apply business logic for event firing

What tends to work well here:

Business rules and conditional logic — This is where you define what events fire when and under what conditions (for example, “fire purchase event only when order confirmation page loads AND transaction ID exists”)
Schema validation (ensure the required fields are present)
Minimal field sanitization (trim whitespace, basic normalization like lowercasing)
Timestamp generation
Consent/opt-out verification

What to approach with caution:

Heavy enrichment or lookups
Complex joins or concatenation
PII persistence
Resource-intensive operations (this runs on your user's device)

Owner: Frontend/Tagging Team

Launch is where your business logic lives — the conditional firing of events based on user behavior and context. This is appropriate here because you're capturing the business intent at the moment it happens. Keeping the execution lightweight protects page performance.

Stage 2: Dynamic datastream configuration (ingestion filtering)

Primary purpose: Gatekeeping at the ingestion level and feature management

What tends to work well here:

Filter obvious noise events (for example, crawlers and bots)
Exclude deprecated events — When you've decided to stop tracking certain events (feature toggles, sunset features, deprecated tracking), filtering them here rather than removing instrumentation keeps re-enablement straightforward
Initial privacy masking (drop raw PII fields before forwarding)
Consent enforcement as a safety net

Owner: Tagging/Platform Team in collaboration with Privacy/Legal

This stage acts as your bouncer, keeping bad or unwanted data from entering your storage and processing systems. It's also well-suited for managing the lifecycle of tracking: when a feature is deprecated, excluding it here avoids touching Launch code while keeping the option to re-enable later.

Stage 3: Adobe datastream (pre-storage normalization)

Primary purpose: Field mapping and fixed value assignment before data lands in storage

What tends to work well here:

Field mapping — Map fields to Adobe Analytics or from client-side to server-side XDM (if not the same schema)
Set fixed values — Add constants like environment indicators, data source identifiers, or processing timestamps
Basic normalization to match schema expectations (for example, date formats)
Server-side timestamp generation
Hashing/anonymization of PII before storage (if collected)
Lightweight enrichment (geo lookups, device lookups)

The Power and the pitfall:

Datastream is quite capable — you can do extensive transformations here, including complex conditional logic. However, it's often a forgotten layer, and that's where the risk lies. Transformations buried in Datastream configuration can be easy to overlook and harder to troubleshoot than equivalent logic in a more prominent system.

A pragmatic approach: use Datastream for straightforward mappings and basic normalization. Complex business rules and conditional logic tend to be more visible, testable, and maintainable in AEP Data Distiller.

Owner: Data Engineering / Platform Team

Stage 4: Adobe Experience Platform (dataset curation)

This stage applies to the framework if you work with Data Distiller.

Primary purpose: Transform raw datasets into analysis-ready curated datasets

Here's how AEP works in practice:

Raw dataset: This is your immutable source — the data exactly as it arrives from data collection (after Datastream processing). This is what gets ingested into AEP.

Curated dataset(s): Using Data Distiller, you transform your raw dataset into curated datasets prepared for Customer Journey Analytics. Because you typically don't need all raw data flowing into CJA, this is where you make deliberate choices about what to carry forward.

What tends to work well here:

Data selection and filtering — Choose which events and fields CJA actually needs
Identity resolution — Merging anonymous and known identities happens at the platform level via the Identity Graph (not as a Distiller function by itself, but the curation job can leverage resolved identities when preparing datasets)
Historical corrections and backfills when errors are discovered
Complex aggregations or pre-calculations for performance
Data type standardization and cleanup

Critical consideration: Dataset duplication and licensing

When you maintain both raw and curated datasets with similar data, you're effectively storing data twice. It's worth reviewing your AEP license — you're typically licensed for total volume/rows. The delta between raw and curated is your “waste” from a storage perspective, but also your insurance from an auditability perspective.

Owner: Data Engineering + Analytics (joint ownership), with governance oversight

Stage 5: Customer Journey Analytics (presentation layer)

Primary purpose: Semantic layer and self-service enablement

What tends to work well here:

Derived fields — CJA's derived field functionality lets you create new dimensions and metrics at the reporting layer. These are semantic and presentational in nature — they help users work with data in familiar business terms, but they're not pipeline transformations in the same sense as Data Distiller jobs
Calculated metrics for business KPIs
User-friendly component naming and descriptions
Report-specific filters and segments
Attribution model application
Component organization for ease of discovery

The self-service perspective:

CJA Data views are your semantic layer. You're not changing the underlying data; you're creating a business-friendly interface to technical data. Think of it this way: the lake contains technical field names like commerce.purchases.value, and Data views translate these into business concepts like “Revenue.”

Consider being careful about:

Over-transforming data that should have been fixed upstream
Creating so many variations that users don't know which to use
Duplicating transformations across multiple Data views (use shared dimensions and metrics)
Using Data views as a band-aid for poor data quality upstream

Owner: Analytics/Reporting Team

CJA's flexibility is both its strength and its challenge. You can do a lot here, but it's worth asking: “Am I making this more self-service friendly, or am I compensating for problems that should be fixed in the curated dataset?” The former builds capability; the latter accumulates technical debt.

Practical example: The full E-T-T-T-L-T-L-T workflow

Let's walk through a realistic example of how data flows through the pipeline when a user completes a purchase:

Extract — User interaction

User completes purchase on website
The browser captures event context (page URL, timestamp, user agent)

Transform 1 — Collection (Launch)

Launch evaluates business rules: “Is this checkout confirmation page AND does a transaction object exist?”
If conditions are met, fires purchase event with order details
Validates required fields (transaction ID, revenue, product array)
Checks consent status (user has analytics consent)
Sets client-side timestamp
Sends event to Dynamic Datastream

Transform 2 — Filtering (dynamic datastream)

Checks event type and source
Excludes deprecated “legacy_checkout_v1” events (feature sunsetted 3 months ago)
Excludes known bots and crawlers
Forwards purchase event (passes all filters) to Datastream

Transform 3 — Normalization (datastream server-side)

Maps field names (for example, ECID as persistent ID)
Sets fixed values: dataSource = “web”, environment = “production”
Normalizes date formats
Hashes email addresses

Load 1 — Ingestion (AEP raw dataset)

Structured event ingested into AEP
Stored in the raw dataset exactly as received from Datastream
Immutable record preserved for audit and replay

Transform 4 — Curation (AEP data distiller)

Raw Dataset → Curated Dataset transformation (for example, runs hourly):

Filters to events needed for CJA: keeps purchase, product_view, add_to_cart; excludes diagnostic events
Identity resolution: the curation job works with Identity Graph-resolved identities to link anonymous web visitors with known customers
Deduplication: removes duplicate purchase events (same transactionID within 5-minute window)
Business rule application: flags “valid_transaction” (revenue > 0, has shipping address)
Historical correction: applies retroactive fix for currency bug discovered last week

Load 2 — Connection (CJA dataset)

“web_events_curated” dataset (smaller than raw, optimized for CJA)
Connected to CJA as data source
Available for Data view configuration

Transform 5 — Presentation (CJA data views)

Business users access curated data through a familiar interface:

Derived field: “Product Line” extracted from SKU using CASE WHEN logic — a semantic label created at the reporting layer, not a pipeline transformation
Reference data join: adds product category, brand, and margin data from product catalog (lookup dataset)
Calculated metric: “Average Order Value” = Total Revenue / Number of Orders
Component renamed: Technical marketing.trackingCode appears as user-friendly “Campaign Source”
Attribution applied: Last-touch attribution model for revenue credit
Segment created: “High-Value Customers” (lifetime revenue > CHF1000)

Quick wins: Control rules you can implement today

XDM Schema Audit: Review your current schema and add type constraints, patterns, and min/max values where missing
Business Rule Documentation: Document every conditional event firing rule in Launch with clear comments explaining business logic (using API, notes within Adobe Launch or a manually maintained change-log)
Required Fields Check: Alert when completion rate falls below 100% for critical fields in the raw dataset
Deprecation Log: Maintain a log of events excluded in Dynamic Datastream with reasons, dates, and re-enablement conditions
Cardinality Monitoring: Alert on sudden spikes in unique page IDs, user IDs, or product SKUs (often indicates tagging errors)
Data view Naming Convention: Standardize component naming (for example, all calculated metrics start with “Calc:”, all segments with “Seg:”)
Storage Audit Dashboard: Monthly review of raw vs. curated dataset sizes, row counts, and cost projections
Transformation Count by Stage: Track how many transformations exist at each stage (a useful heuristic: most business logic should live in AEP curation)

Conclusion: Master your pipeline

Adobe's data collection ecosystem gives you genuine flexibility to transform data at multiple stages. The challenge — and the opportunity — is learning to use that flexibility with intention.

The goal this framework works toward is a setup where business users work with familiar concepts and trustworthy numbers, without needing to understand the eight-step technical pipeline behind them. When that's working well, data quality becomes invisible to end users — not because problems don't exist, but because they're being caught and handled at the right layer.

A few principles worth keeping in mind as you apply this:

Start with a schema — define your data quality contract before anything else
Keep raw data as immutable as possible
Handle structural corrections in AEP curation, where they're most visible and testable
Use CJA for semantic translation, not data cleanup
Document every transformation — your future self (and your colleagues) will thank you
Monitor your costs: raw + curated typically means 2x storage

Following this kind of staged approach won't eliminate all data quality problems, but it can dramatically reduce the time spent tracking them down — and give you a clearer picture of where to look when something does go wrong.

style

article-content-section