Ingest data using Cloud Storage source connectors

This video shows how to easily batch ingest data from cloud storage services into Adobe Experience Platform’s Real-Time Customer Profile and data lake, in a seamless and scalable manner. For more detailed product documentation, see cloud storage on the Source Connectors overview page or the Amazon S3 source connector documentation.

Hi, there. I’m going to give you a quick overview of how to ingest data from your Cloud Storage sources into Adobe Experience Platform. Data ingestion is a fundamental step to getting your data in Experience Platform so you can use it to build 360 degree Real-time Customer Profiles and use them to provide meaningful experiences. You can ingest data from a wide variety of sources, such as Adobe applications Cloud-based storage, databases, and many others. When you log into Platform you will see sources in the left navigation. Clicking Sources will take you to Source Catalog screen. Where we can see all of the source connectors currently available in Platform. Note that there are source connectors for Adobe applications CRM solutions, Cloud storage providers and more. Let’s explore how to ingest data from Cloud Storage into Experience Platform. Each source has its specific configuration details but the general configuration for Cloud Storage sources are somewhat similar. For our video, let’s use the Amazon S3 Cloud Storage. Select the desired source. When selecting up a source connector for the first time you will be provided with an option to configure. Since this is our first time creating an S3 account let’s click on Creating a New Account and provide the source connection details. Complete the required fields for account authentication and then initiate a source connection request. If the connection is successful, click Next to proceed to Data Selection. In this step, we choose the source file for data ingestion and verify the file data format. Note that the ingested file data can be formatted as JSON, XDM parquet or delimited. Currently, for delimited files you have an option to preview sample data of the source file. Let’s proceed to the next step to assign a target Dataset for the incoming data. Let’s choose the New Dataset option and provide a Dataset name and description. To create a Dataset, you need to have an associated schema. Using the schema finder, assign a schema to this particular Dataset. Upon selecting a schema for this Dataset Experience Platform performs a mapping within the source file field and the target field. This mapping is performed based on the title and type of the fields. This pre-mapping of standard fields are editable. You can add a new field mapping to map a source field to a target field. Add Calculated Field option lets you run functions on source fields to prepare the data for ingestion. For example, we can combine the first name and the last name field into a calculated field using the Concatenation function before ingesting the data into a Dataset field. We can also preview the sample result of a calculated field. After reviewing the field mapping, we can also preview data to see how the ingested data will get stored in your Dataset. The mapping looks good, let’s move to the next step. Scheduling lets you choose a frequency at which data should flow from source to a Dataset. Let’s select a frequency of 15 minutes for this video and set a start time for data flow. To let historical data to be ingested enable the backfill option. Backfill is a Boolean value that determines what data is initially ingested. If backfill is enabled, all current files in the specific pot will be ingested during the first scheduled ingestion. If backfill is disabled, only the files that are loaded in between the first run of ingestion and the start time will be ingested. Let’s move to the data flow step, provide a name for your data flow. In the data flow detail step, the Partial Ingestion toggle allows you to enable or disable the use of partial batch ingestion. The Error Threshold allows you to set the percentage of acceptable errors before they enter batch fields. By default, this value is set to 5%. Let’s review the source configuration details and save your changes. We do not see any data flow run statuses as we have set a frequency of 15 minutes for our data flow runs. So let’s wait for the data flow run to trigger, let’s refresh the page and you can now see that our data flow run status has been completed. Open the data flow run to view more details about the activity. Our last data flow run was completed successfully with a few failed records. Do you wonder why the most recent data flow run was successful and when it had failed records. That’s because we enabled partial ingestion when we set up the data flow and chosen error threshold of 5%. Since we enabled error diagnosis for our data flows, we can also see the error code and description in the Data Flow Run Overview window. Experience Platform lets users preview or download the error diagnosis to determine what went wrong with the failed records. Let’s go back to the Data Flow Activity tab. At this point, we verified that the data flow was completed successfully from the source to our Dataset. Let’s open our Dataset to verify the data flow and its activities. You can open the Luma Customer Loyalty Dataset right from the data flow window or you can access it to using the Datasets option from the left navigation. Under the Dataset Activity we can see a quick summary of ingested batches and failed batches during a specific time window. Scroll down to view the ingested batch ID. Each batch represents actual data ingestion from a source connector to a target Dataset. With Real-time Customer Profile we can see each customer’s holistic view that combines data from multiple channels, including online, offline, CRM and then third-party data. To enable our Dataset for a Real-time Profile, ensure that the associated schema is enabled for Real-time Profile. Once a schema is enabled for profile it cannot be disabled or deleted. Also, fields cannot be removed from the schema after this point. These implications are essential to keep in mind when working with the data in your production environment. It is recommended to verify and test the data ingestion process to capture and address any issues that may arise before enabling the Dataset and schema for the profile. Now let’s enable profile for our Dataset and save our changes. In the next successful batch run data ingested into our Dataset will be used for creating Real-time Customer Profiles. You can also preview the ingested data into the Dataset using the Preview Dataset option. Finally, we can also add any governance label to the Dataset fields to restrict its usage under the Data Governance tab.
A final confirmation step is to inspect a profile that should have been updated in the last ingestion. Look on Profiles in the left navigation and then click on Browse. Find a profile by using an identity namespace and a corresponding identity value. Look up a profile using one of its identities. In this case, let’s use the loyalty ID and make sure it has the new fields. Adobe Experience Platform allows data to be ingested from external sources while providing you with the ability to structure, label labor and enhance incoming data using Platform services. -