All data that is successfully ingested into Adobe Experience Platform is persisted within the Data Lake as datasets. A dataset is a storage and management construct for a collection of data, typically a table, that contains a schema (columns) and fields (rows). Datasets also contain metadata that describes various aspects of the data they store.
This document provides a high-level overview of datasets in Experience Platform.
Catalog Service is the system of record for data location and lineage within Experience Platform, and is used to create and manage datasets. Catalog tracks the metadata for each dataset, which includes a reference to the Experience Data Model (XDM) schema the dataset conforms to (explained in the next section) and the number of records ingested into that dataset.
See the Catalog Service overview for more information.
Experience Data Model (XDM) is the standardized framework by which Platform organizes customer experience data. All data that is ingested into Platform must conform to a pre-defined XDM schema before it can be persisted in the Data Lake as a dataset.
All datasets contain a reference to the XDM schema that constrains the format and structure of the data that they can store. Attempting to upload data to a dataset that does not conform to the dataset’s XDM schema will cause ingestion to fail.
For more information on XDM, see the XDM System overview.
Adobe Experience Platform Data Ingestion represents the multiple methods by which Platform ingests data from various sources. Regardless of the method of ingestion, all successfully ingested data is converted to batch files. Batches are units of data that consist of one or more files to be ingested as a single unit. These batch files are then added to dedicated datasets and persisted within the Data Lake.
See the Data Ingestion overview for more information.
Adobe Experience Platform Data Governance allows you to manage customer data in order to ensure compliance with regulations, restrictions, and policies applicable to data use. The Data Governance framework allows you to apply usage labels to categorize data according to the usage policies that apply to that data. Labels can be applied to individual schemas, fields within those schemas, and entire individual datasets. When labels are applied directly to a schema, those labels are propagated to all existing and future datasets that are based on that schema.
Labels can no longer be applied to fields at the dataset level. This workflow has been deprecated in favour of applying labels at the schema level. Any labels previously applied at the dataset object level will still be supported through the Platform UI until 31st May 2024. To ensure that your labels are consistent across all schemas, any labels previously attached to fields at the dataset level must be migrated to the schema level by you over the coming year. See the section on migrating previously applied labels for instructions on how to do this.
See the Data Governance overview for more information on the service. For steps on how to work with usage labels in Platform, refer to the following guides:
Once datasets have been used to store ingested data, those datasets are then used by downstream Platform services to update customer profiles, gain insights through machine learning, and more.
The following is a list of downstream services that use datasets for various operations. Please review the documentation for each service for more information.
By reading this document, you have been introduced to the core uses of datasets in Experience Platform, as well as the various Platform services that utilize datasets. For more details on the many ways datasets are used in Platform, please review the service documentation linked throughout this overview.
For steps on how to interact with datasets within the Experience Platform UI, see the datasets user guide.