Data prep for data hygiene

Learn how to support data minimization principles with the data prep feature of Experience Platform. Learn how to ingest only the fields you need and hash data during ingestion.

In this video, we’ll show you how to use Data Prep tools in Experience Platform to ensure the data you’re ingesting follows best practices for data privacy and data governance. There are multiple things data stewards and engineers keep in mind when managing customer data and Platform. First and foremost, they need to make sure that all data being collected, sorted and activated through Platform adheres to data minimization principles. These principles require companies to collect only the personal data they really need, and keep it only for as long as they need it. Secondly, making sure that your data storage stays within the limits of your license entitlements with Platform is also a priority. For example, Analytics data typically consumes significantly more storage than CRM data. This makes curating your Analytics data critical when it comes to keeping within your licensing entitlements. Finally, any third party contractual obligations you may have, as well as your internal privacy policies and guidelines, will also affect how you handle your incoming customer data. Platform’s Data Prep functionalities give you powerful granular control over your ingested data, such as hashing data and selecting only required fields and events from a source, so you can be confident that you meet all requirements for data privacy and governance that you might have. Let’s jump into the Platform interface where I’ll walk you through these processes. Here you can see I’m uploading a CSV file to ingest into Platform. And I’ve gotten to the step where I’m mapping my source fields to the fields in the target schema in Platform. This is just a one-off ingestion workflow that I’m showing as an example. But when you set up all these on source connections for other products, you’ll come across the same mapping step and all the same principles apply. Following the data minimization principle, we want to restrict our data collection to only what’s required for our business purposes, and this is the perfect step to do so. Let’s say we’re ingesting events from our website, and every event comes with a siteDomain field. We don’t use this data anywhere, and it’s not a required field in our schema. So even though we can map the siteDomain field to this XDM field in the target schema, we instead want to completely exclude this data from ingestion. For this, I’ll simply click the remove icon here and this field won’t be mapped.
Now let’s see how we can minimize data usability by still bringing data into Platform. Let’s say we only collect email addresses to use them as part of our customer identities, but we’re not planning to send actual emails to our customers. So it would make sense to hash this data during ingestion. In Platform, you can do this with calculated fields as you can see, we collect personal email addresses that for our example need to be hashed. I’m going to add a new calculated field type. And then I’ll look for the hashing function. You can utilize the search bar here in the left rail to narrow down your list. Platform supports various hash algorithms, including SHA-1, SHA-256, MD5, and so on. For the full list, refer to the documentation. In our case, I’m going to use SHA-256, which is the function right here. Just click the plus icon to add it.
And as you can see, we now need to provide an argument. This will be the field whose value we want to hash. I’ll simply type personalEmail here. We can hit preview to see what our sample output will be. Then I’ll click Save and our calculated field is added to the source data fields. Now we need to map it to our target schema.
I’ll select the emailID field here, and our field is successfully mapped. Since we don’t need to store the unhashed version of our customer email addresses for business purposes, I can now safely delete this mapping. And here we go. All we have left to do now is finish the process. Now let’s see how we can filter data during ingestion so that you only bring in the data you need. For this example, we’ll use the Analytics data source. Many Platform users have up to 90% of their customer profiles being populated by Analytics data compared to that of their CRM data. That’s why controlling what data gets ingested from Analytics is important for staying within your Platform licensing entitlements. To see the Filtering step here in the ingestion workflow, you first need to enable your data for profile when creating a source connection with Adobe Analytics. This is because the filtering functionality only applies to data that goes into the Profile store, not data that goes into the Data Lake. You can filter data for profile ingestion at the row and column level. The row level filtering allows you to specify which data to include for profile ingestion, while the column level filtering is useful to select which data to exclude. Let’s start with the row level. At the row level, you can use the left rail to navigate through the schema hierarchy and find the attributes you want to filter. Depending on the size of your schema, you might want to utilize the search bar to narrow down your list. Let’s say my Analytics report suite contains reservations for multiple countries, but I only want to bring in data related to the United States. All I need to do is simply drag and drop the Country attribute here. We have different conditions here in the dropdown such as equals, starts with and so on. But I’m going to keep this set to equals. I’ll type in “United States” in this text box and press Enter. Now let’s look at the column level filtering. Here we have the interactive schema tree that contains all your schema attributes at the column level. For example, if I want to exclude some mobile application events, I’ll expand the Application column here. Select the applicationCloses and then select all the boolean type attributes.
So these are some of the Data Prep functionalities you can use to improve your business practices related to data privacy and governance. We hope this will help you effectively organize your workflows to ingest and store only the customer data that’s really necessary for your processes. Thanks for watching.