10 minutes
h1

Adobe’s data collection process spans multiple stages, each capable of transforming your data, which makes it powerful but also complex when issues arise. This article explains what happens at each step and offers a practical framework for deciding where data quality rules should live so you can reduce confusion, build trust in your reporting, and resolve issues faster.

Navigating the multi-stage transformation pipeline: A strategic guide to managing data transformations across Adobe's ecosystem

If you’ve worked with Adobe’s data collection stack for any time, you’ve likely seen it: a metric looks wrong, a field is missing, or a segment returns unexpected results. When you start investigating, you realize the change could have happened in Launch, Datastream, AEP, Data Distiller, or CJA. It becomes an archaeological dig through multiple systems to figure out where the data shifted.

This isn't a flaw in Adobe's design so much as a consequence of its flexibility. The platform gives you transformation capabilities at nearly every stage, which is powerful — but without a clear strategy for what belongs where, that flexibility becomes a source of persistent confusion and technical debt.

This article aims to help you develop that strategy. By the end, you'll have a practical framework for deciding where specific data quality work belongs across the pipeline, how to avoid common placement mistakes, and how to structure your setup so that future debugging doesn't require an expedition.

To make sense of the full picture, it helps to first understand why Adobe's pipeline doesn't behave like a traditional ETL process — and then look at a model for thinking about where responsibility for data quality should sit at each stage.

First, let's talk about “E-T-T-T-L-T-L-T”

Before we dive in, it's worth addressing what I mean by this acronym and why traditional ETL thinking doesn't map cleanly here.

Breaking down the E-T-T-T-L-T-L-T:

Why traditional ETL thinking fails here

Adobe Launch isn't “extracting” in the traditional Data Warehouse sense — it's capturing and generating event data based on user interactions. The extraction is passive (reading DOM, cookies, dataLayer), but the primary function is event production and business rule application.

Loading happens multiple times:

  1. Ingestion into AEP Raw Dataset — First persistent storage

  2. Materialization of Curated Dataset — Data Distiller creates new datasets

  3. Connection to CJA — Curated dataset made available to CJA (more of a “connect” than a “load”)

A more practical mental model

While E-T-T-T-L-T-L-T accurately describes the technical flow, thinking in terms of stages and their purposes is often more useful in day-to-day work:

The key insight isn't perfect terminology — it's recognizing that you have transformation capabilities at five distinct stages, and having a strategy for which transformations belong where. That's what this guide is about.

The guiding principles

Before looking at specific stages, it's useful to establish a few foundational principles that tend to hold across well-structured Adobe implementations:

1. Raw = immutable

Your raw event data is generally best stored exactly as it arrives. This supports auditability, enables backfilling, and gives you a reliable debugging foundation. Destructive changes at the raw layer create problems that are difficult to trace later.

2. Single source of truth

Production reports and downstream processes are ideally read from one curated dataset layer (typically in AEP). This is your “golden” dataset — the version everyone trusts.

3. Transform once, present many

Correcting and standardizing data at one point in your pipeline (preferably the curated dataset layer) keeps things simpler. Reporting tools are then left to apply presentational or contextual adjustments only.

4. Clear ownership

Every layer benefits from a clear owner — whether that's client-side developers, data engineering, or analytics teams. When ownership is ambiguous, data issues tend to slip through unaddressed.

5. Observability & data quality KPIs

Automated checks for completeness, schema conformance, duplicates, and latency help issues surface early. Monitoring is most effective when it's built in from the start rather than added reactively.

The strategic framework: What should happen where?

Here's a breakdown of each stage with guidance on its purpose and the transformations that tend to belong there.

The foundation: XDM schema as your data quality contract

Before you implement any transformations, it helps to have a solid foundation. Your XDM (Experience Data Model) schema isn't just documentation — it's your first and most important line of defense for data quality.

Think of your XDM schema as a contract between all stages of your pipeline. It defines what data that you expect to receive, what format it should be in, what constraints it must satisfy, and what's required as optional. Everything else in your pipeline either enforces this contract or builds upon it.

Stage 1: Adobe Launch (collection & business rules)

Primary purpose: Capture user interactions and apply business logic for event firing

What tends to work well here:

What to approach with caution:

Owner: Frontend/Tagging Team

Launch is where your business logic lives — the conditional firing of events based on user behavior and context. This is appropriate here because you're capturing the business intent at the moment it happens. Keeping the execution lightweight protects page performance.

Stage 2: Dynamic datastream configuration (ingestion filtering)

Primary purpose: Gatekeeping at the ingestion level and feature management

What tends to work well here:

Owner: Tagging/Platform Team in collaboration with Privacy/Legal

This stage acts as your bouncer, keeping bad or unwanted data from entering your storage and processing systems. It's also well-suited for managing the lifecycle of tracking: when a feature is deprecated, excluding it here avoids touching Launch code while keeping the option to re-enable later.

Stage 3: Adobe datastream (pre-storage normalization)

Primary purpose: Field mapping and fixed value assignment before data lands in storage

What tends to work well here:

The Power and the pitfall:

Datastream is quite capable — you can do extensive transformations here, including complex conditional logic. However, it's often a forgotten layer, and that's where the risk lies. Transformations buried in Datastream configuration can be easy to overlook and harder to troubleshoot than equivalent logic in a more prominent system.

A pragmatic approach: use Datastream for straightforward mappings and basic normalization. Complex business rules and conditional logic tend to be more visible, testable, and maintainable in AEP Data Distiller.

Owner: Data Engineering / Platform Team

Stage 4: Adobe Experience Platform (dataset curation)

This stage applies to the framework if you work with Data Distiller.

Primary purpose: Transform raw datasets into analysis-ready curated datasets

Here's how AEP works in practice:

Raw dataset: This is your immutable source — the data exactly as it arrives from data collection (after Datastream processing). This is what gets ingested into AEP.

Curated dataset(s): Using Data Distiller, you transform your raw dataset into curated datasets prepared for Customer Journey Analytics. Because you typically don't need all raw data flowing into CJA, this is where you make deliberate choices about what to carry forward.

What tends to work well here:

Critical consideration: Dataset duplication and licensing

When you maintain both raw and curated datasets with similar data, you're effectively storing data twice. It's worth reviewing your AEP license — you're typically licensed for total volume/rows. The delta between raw and curated is your “waste” from a storage perspective, but also your insurance from an auditability perspective.

Owner: Data Engineering + Analytics (joint ownership), with governance oversight

Stage 5: Customer Journey Analytics (presentation layer)

Primary purpose: Semantic layer and self-service enablement

What tends to work well here:

The self-service perspective:

CJA Data views are your semantic layer. You're not changing the underlying data; you're creating a business-friendly interface to technical data. Think of it this way: the lake contains technical field names like commerce.purchases.value, and Data views translate these into business concepts like “Revenue.”

Consider being careful about:

Owner: Analytics/Reporting Team

CJA's flexibility is both its strength and its challenge. You can do a lot here, but it's worth asking: “Am I making this more self-service friendly, or am I compensating for problems that should be fixed in the curated dataset?” The former builds capability; the latter accumulates technical debt.

Practical example: The full E-T-T-T-L-T-L-T workflow

Let's walk through a realistic example of how data flows through the pipeline when a user completes a purchase:

Extract — User interaction

Transform 1 — Collection (Launch)

Transform 2 — Filtering (dynamic datastream)

Transform 3 — Normalization (datastream server-side)

Load 1 — Ingestion (AEP raw dataset)

Transform 4 — Curation (AEP data distiller)

Raw Dataset → Curated Dataset transformation (for example, runs hourly):

Load 2 — Connection (CJA dataset)

Transform 5 — Presentation (CJA data views)

Business users access curated data through a familiar interface:

Quick wins: Control rules you can implement today

  1. XDM Schema Audit: Review your current schema and add type constraints, patterns, and min/max values where missing

  2. Business Rule Documentation: Document every conditional event firing rule in Launch with clear comments explaining business logic (using API, notes within Adobe Launch or a manually maintained change-log)

  3. Required Fields Check: Alert when completion rate falls below 100% for critical fields in the raw dataset

  4. Deprecation Log: Maintain a log of events excluded in Dynamic Datastream with reasons, dates, and re-enablement conditions

  5. Cardinality Monitoring: Alert on sudden spikes in unique page IDs, user IDs, or product SKUs (often indicates tagging errors)

  6. Data view Naming Convention: Standardize component naming (for example, all calculated metrics start with “Calc:”, all segments with “Seg:”)

  7. Storage Audit Dashboard: Monthly review of raw vs. curated dataset sizes, row counts, and cost projections

  8. Transformation Count by Stage: Track how many transformations exist at each stage (a useful heuristic: most business logic should live in AEP curation)

Conclusion: Master your pipeline

Adobe's data collection ecosystem gives you genuine flexibility to transform data at multiple stages. The challenge — and the opportunity — is learning to use that flexibility with intention.

The goal this framework works toward is a setup where business users work with familiar concepts and trustworthy numbers, without needing to understand the eight-step technical pipeline behind them. When that's working well, data quality becomes invisible to end users — not because problems don't exist, but because they're being caught and handled at the right layer.

A few principles worth keeping in mind as you apply this:

Following this kind of staged approach won't eliminate all data quality problems, but it can dramatically reduce the time spent tracking them down — and give you a clearer picture of where to look when something does go wrong.