A dataflow is a scheduled task that retrieves and ingests data from a source to a Platform dataset. This tutorial provides steps to configure a new dataflow using your cloud storage account.
This tutorial requires a working understanding of the following components of Adobe Experience Platform:
Additionally, this tutorial requires that you have an established cloud storage account. A list of tutorials for creating different cloud storage accounts in the UI can be found in the source connectors overview.
Experience Platform supports the following file formats to be ingested from external storages:
After creating your cloud storage account, the Select data step appears, providing an interface for you to explore your cloud storage file hierarchy.
Selecting a listed folder allows you to traverse the folder hierarchy into deeper folders. You can select single folder to ingest all files in the folder recursively. When ingesting an entire folder, you must ensure that all files in the folder share the same schema.
Once you have a compatible file or folder selected, select the corresponding data format from the Select data format dropdown menu.
The following table displays the appropriate data format for the supported file types:
|File type||Data format|
Select JSON and wait a few seconds for the preview interface to populate.
Unlike delimited and JSON file types, Parquet formatted files are not available for preview.
The preview interface allows you to inspect the contents and structure of a file. By default, the preview interface displays the first file in the folder you selected.
To preview a different file, select the preview icon beside the name of the file you want to inspect.
Once you have inspected the contents and structure of the files in your folder, select Next to ingest all files in the folder recursively.
If you prefer to select a specific file, select the file you want to ingest, and then select Next.
You can set a custom delimiter when ingesting delimited files. Select the Delimiter option and then select a delimiter from the dropdown menu. The menu displays the most frequently used options for delimiters, including a comma (
,), a tab (
\t), and a pipe (
|). If you prefer to use a custom delimiter, select Custom and enter a single-character delimiter of your choice in the pop up input bar.
Once you have selected your data format and set your delimiter, select Next.
You can ingest compressed JSON or delimited files by specifying their compression type.
In the Select data step, select a compressed file for ingestion and then select its appropriate file type and whether it’s XDM-compliant or not. Next, select Compression type and then select the appropriate compressed file type for your source data.
With a compressed file type identified, select Next to proceed.
The Mapping step appears, providing an interactive interface to map the source data to a Platform dataset. Source files formatted in Parquet must be XDM compliant and do not require you to manually configure the mapping, while CSV files require you to explicitly configure the mapping, but allow you to pick which source data fields to map. JSON files, if marked as XDM complaint, does not require manual configuration. However, if it is not marked as XDM compliant, it will require you to explicitly configure the mapping.
Choose a dataset for inbound data to be ingested into. You can either use an existing dataset or create a new one.
Use an existing dataset
To ingest data into an existing dataset, select Existing dataset, then select the dataset icon.
The Select dataset dialog appears. Find the dataset you you wish to use, select it, then click Continue.
Use a new dataset
To ingest data into a new dataset, select New dataset and enter a name and description for the dataset in the fields provided. To add a schema, you can enter an existing schema name in the Select schema dialog box. Alternatively, you can select the Schema advanced search to search for an appropriate schema.
During this step, you can enable your dataset for Real-time Customer Profile and create a holistic view of an entity’s attributes and behaviors. Data from all enabled datasets will be included in Profile and changes are applied when you save your dataflow.
Toggle the Profile dataset button to enable your target dataset for Profile.
The Select schema dialog appears. Select the schema you wish to apply to the new dataset, then select Done.
Based on your needs, you can choose to map fields directly, or use data prep functions to transform source data to derive computed or calculated values. For comprehensive steps on using the mapper interface and calculated fields, see the Data Prep UI guide.
For JSON files, in addition to directly mapping fields to other fields, you can directly map objects to other objects and arrays to other arrays You can also preview and map complex data types such as arrays in JSON files using a cloud storage source connector.
Please note that you cannot map across different types. For example, you cannot map an object to an array, or a field to an object.
Platform provides intelligent recommendations for auto-mapped fields based on the target schema or dataset that you selected. You can manually adjust mapping rules to suit your use cases.
Select Preview data to see mapping results of up to 100 rows of sample data from the selected dataset.
During the preview, the identity column is prioritized as the first field, as it is the key information necessary when validating mapping results.
Once your source data is mapped, select Close.
The Scheduling step appears, allowing you to configure an ingestion schedule to automatically ingest the selected source data using the configured mappings. The following table outlines the different configurable fields for scheduling:
|Frequency||Selectable frequencies include
|Interval||An integer that sets the interval for the selected frequency.|
|Start time||A UTC timestamp indicating when the very first ingestion is set to occur.|
|Backfill||A boolean value that determines what data is initially ingested. If Backfill is enabled, all current files in the specified path will be ingested during the first scheduled ingestion. If Backfill is disabled, only the files that are loaded in between the first run of ingestion and the start time will be ingested. Files loaded prior to start time will not be ingested.|
Dataflows are designed to automatically ingest data on a scheduled basis. Start by selecting the ingestion frequency. Next, set the interval to designate the period between two flow runs. The interval’s value should be a non-zero integer and should be set to greater than or equal to 15.
To set the start time for ingestion, adjust the date and time displayed in the start time box. Alternatively, you can select the calendar icon to edit the start time value. Start time must be greater than or equal to the current time in UTC.
Provide values for the schedule and select Next.
For batch ingestion, every ensuing dataflow selects files to be ingested from your source based on their last modified timestamp. This means that batch dataflows select files from the source that are either new or have been modified since the last flow run. Furthermore, you must ensure that there’s a sufficient time span between file upload and a scheduled flow run because files that are not entirely uploaded to your cloud storage account before the scheduled flow run time may not be picked up for ingestion.
To set up one-time ingestion, select the frequency drop down arrow and select Once. You can continue to make edits to a dataflow set for a one-time frequency ingestion, so long as the start time remains in the future. Once the start time has passed, the one-time frequency value can no longer be edited. Interval and Backfill are not visible when setting up a one-time ingestion dataflow.
It is strongly recommended to schedule your dataflow for one-time ingestion when using the FTP connector.
Once you have provided appropriate values to the schedule, select Next.
The Dataflow detail step appears, allowing you to name and give a brief description about your new dataflow.
During this process, you can also enable Partial ingestion and Error diagnostics. Enabling Partial ingestion provides the ability to ingest data containing errors, up to a certain threshold that you can set. Enabling Error diagnostics will provide details on any incorrect data that is batched separately. For more information, see the partial batch ingestion overview.
Provide values for the dataflow and select Next.
The Review step appears, allowing you to review your new dataflow before it is created. Details are grouped within the following categories:
Once you have reviewed your dataflow, click Finish and allow some time for the dataflow to be created.
Once your dataflow has been created, you can monitor the data that is being ingested through it to see information on ingestion rates, success, and errors. For more information on how to monitor dataflow, see the tutorial on monitoring accounts and dataflows in the UI.
You can delete dataflows that are no longer necessary or were incorrectly created using the Delete function available in the Dataflows workspace. For more information on how to delete dataflows, see the tutorial on deleting dataflows in the UI.
By following this tutorial, you have successfully created a dataflow to bring in data from an external cloud storage, and gained insight on monitoring datasets. To learn more about creating dataflows, you can supplement your learning by watching the video below. Additionally, incoming data can now be used by downstream Platform services such as Real-time Customer Profile and Data Science Workspace. See the following documents for more details:
The Platform UI shown in the following video is out-of-date. Please refer to the documentation above for the latest UI screenshots and functionality.
The following sections provide additional information for working with source connectors.
When a dataflow is created, it immediately becomes active and ingests data according to the schedule it was given. You can disable an active dataflow at any time by following the instructions below.
Within the Sources workspace, click the Browse tab. Next, click the name of the account that’s associated the active dataflow you wish to disable.
The Source activity page appears. Select the active dataflow from the list to open its Properties column on the right-hand side of the screen, which contains an Enabled toggle button. Click the toggle to disable the dataflow. The same toggle can be used to re-enable a dataflow after it has been disabled.
Inbound data from your source connector can be used towards enriching and populating your Real-time Customer Profile data. For more information on populating your Real-time Customer Profile data, see the tutorial on Profile population.