All data that is successfully ingested into Adobe Experience Platform is persisted within the Data Lake as datasets. A dataset is a storage and management construct for a collection of data, typically a table, that contains a schema (columns) and fields (rows). Datasets also contain metadata that describes various aspects of the data they store.
This document provides a high-level overview of datasets in Experience Platform.
Catalog Service is the system of record for data location and lineage within Experience Platform, and is used to create and manage datasets. Catalog tracks the metadata for each dataset, which includes a reference to the Experience Data Model (XDM) schema the dataset conforms to (explained in the next section) and the number of records ingested into that dataset.
See the Catalog Service overview for more information.
Experience Data Model (XDM) is the standardized framework by which Platform organizes customer experience data. All data that is ingested into Platform must conform to a pre-defined XDM schema before it can be persisted in the Data Lake as a dataset.
All datasets contain a reference to the XDM schema that constrains the format and structure of the data that they can store. Attempting to upload data to a dataset that does not conform to the dataset’s XDM schema will cause ingestion to fail.
For more information on XDM, see the XDM System overview.
Adobe Experience Platform Data Ingestion represents the multiple methods by which Platform ingests data from various sources. Regardless of the method of ingestion, all successfully ingested data is converted to batch files. Batches are units of data that consist of one or more files to be ingested as a single unit. These batch files are then added to dedicated datasets and persisted within the Data Lake.
See the Data Ingestion overview for more information.
Adobe Experience Platform Data Governance allows you to manage customer data in order to ensure compliance with regulations, restrictions, and policies applicable to data use. The Data Governance framework allows you to apply usage labels to categorize data according to the usage policies that apply to that data.
Data usage labels can be applied to entire datasets or individual dataset fields. Labels added at the dataset level are inherited by all fields within that dataset.
See the Data Governance overview for more information on the service. For steps on how to work with usage labels in Platform, refer to the following guides:
Once datasets have been used to store ingested data, those datasets are then used by downstream Platform services to update customer profiles, gain insights through machine learning, and more.
The following is a list of downstream services that use datasets for various operations. Please review the documentation for each service for more information.
By reading this document, you have been introduced to the core uses of datasets in Experience Platform, as well as the various Platform services that utilize datasets. For more details on the many ways datasets are used in Platform, please review the service documentation linked throughout this overview.
For steps on how to interact with datasets within the Experience Platform UI, see the datasets user guide.