DataFacts - AEP’s first Data Anomaly Detection Tool built on App Builder

Softcrylic utilizes Adobe’s App Builder to develop an application that helps identify data inconsistencies within the Adobe Experience Platform (AEP)

Transcript
Thank you guys for joining us. My name is Jerry Helo. I lead the data activation team at South acrylic. We’re here today to talk about data facts. It’s one of Adobe’s first data anomaly detection tool built on Project Firefly or as known now, App Builder. And with me is Sundar. Sundar, can you want to introduce yourself? Thank you, Jerry. Hi, everyone. My name is Sundar Sridharan. And I’m Senior Director of Software Development and Testing at Soft acrylic. That’s my title. Over a decade, I’ve been helping Soft acrylic clients build enterprise softwares and then release them with quality on time. But recently, for the past two years, I’m extremely focused on enterprise data architecture and data quality. And data quality keeps me awake at night. And I’m very excited to show you a solution or a tool that we have built to solve the data quality issues within AEP. Excited to be here. Thank you. Awesome. Yeah, I want to highlight that the reason we actually ended up building this was because of the client needs. So we noticed as we started doing Adobe Experience platform implementations that there was a gap in fully identifying when there is a problem when data is brought into the platform. So the question is why data facts? It’s actually to solve four distinct issues that we’ve seen. The first one is data validity. So think about it. Think about how the data you bring into a platform. There might be different data sources. This is going to feed into a schema. But there is a lot of data in there. It can be coming as a JSON or it can be coming as a flat file. So to have a consistent source of truth within the platform, you want to have clean, valid data always fed into it. In addition to that, completeness. In many cases, we want to build this 360 degree view of the customer. But how do we know where there is certain days data is missing or certain attributes for some records are missing? It happens. Data is never perfect. But we want to know when there is an issue. If you combine validity and completeness, on the other extreme is duplicates. Duplicates in some cases not only impacts your cost of the platform, since you might be charged based on the records and number of profiles, but also from performance. Think about you’re going to be running heavy queries. It’s not good to have duplicate data if that’s not the case. And when it comes to duplicates, it’s beyond just a single record. We’re talking about identities. So depending on how you’re building your identity, whether it’s on a household level or a personal account level, their duplicates might be sneaking in without fully knowing. And finally, data consistency. Our tool look at making sure that the data that you’re replicating, whether it’s being brought in from the data warehouse or a data lake into experience platform, there’s always consistency. So at any point you’re looking at records and doing validation between both systems, there’s no surprises. And with that, I want to hand it over to Sundar, who is going to eventually go through a demo. So we’re going to show you the tool specifically. But initially, Sundar is going to walk through the approach, because this is not a simple problem to solve. And it’s really neat on how we approached it and solved for it. All you, Sundar. Thank you, Jerry. OK, so as Jerry pointed out, so from our customers, we used to hear many data quality issues. And these data quality issues can be categorized into any of these four buckets. So when we started looking at what would be the right approach from a technical solution standpoint to solve these data quality issues, we tried many things. But the two things that really worked out very well for patients in machine learning. I’ll talk about expectations first. It is basically a framework. So you can think of expectations, particularly for the data users, a way for them to express their domain knowledge about the data. So in technical terms, expectations are like assertions for data. So they are very declarative, very flexible, and then extensible. So you can create expectations like, I expect my column values to be unique for a particular column, or I expect my column values to be in a set which represents departments within our company. So using framework like all your domain knowledge about the data can be converted into code. And these expectations can be used to validate your data as and when new data is ingested into your AEP experience platform. So not only this rules-based approach has helped us, but in some instances, like in order to detect data quality issues, what we had to do was we have to look at the past history of the data and also identify patterns and also do some predictions and then identify data quality issues. In those cases, the machine learning models really helped us a lot. So we built active learning machine learning models that actually looks at your past data, identify the historic trends, and then builds patterns and learns from the data. And then it detects data anomalies and duplicates. Particularly these machine learning models or active machine learning models, as I mentioned, they keep learning as and when new data is ingested into AEP and then use that learning knowledge to detect data anomalies and duplicates. So using these two approaches, like the expectations and our machine learning models, we built DataFacts. So DataFacts is, from a technology standpoint, it is actually Adobe Firefly app or the app builder app. And it hosts a lot of soft relics, data quality automation libraries that actually identifies data quality issues as and when new data is ingested into AEP. And then it does all of this automatically. So that’s on the technology side. So the two main features of our tool is, first, A, it detects AEP data users about data quality issues and anomalies at a record level and a column level for all of your AEP data sources, the data that you are ingesting into AEP. And another big feature is we have built, of course, a lot of data quality dashboards that helps our clients to measure the quality of the customer data that’s there in AEP and also help them to identify and resolve data quality issues using our drill down reports. So we have deployed our tool in a couple of our client instances, AEP instances. And what we see, or our clients have realized these benefits is, first is the better audience targeting. Because your segment now can use quality data about your customers. Now you are using this quality data, you are able to target your customers with more accuracy. And also some of the campaigns that are activated through the quality check segments, like we have seen, has increased customer engagement and satisfaction and retention. And particularly, our data dedupe customer profile level helps to save marketing dollars by reducing the cost, by removing all the duplicate customer profiles. And also it improves campaign effectiveness. So these are some of the benefits that our actual customers are able to realize using our Data Facts tool. So before any further ado, I wanted to show a quick demo of the tool and then go through a few examples as well. So I’m going to move back and show our tool. And get started. So here is a general UI of our tool. Basically, after our Firefly app is installed in your AEP instance, it starts with a setting screen in our app. So in the settings screen, one side of the app is installed. Do you mind just going full screen, just so that it’s a little more visible? Yeah. Thank you so much, Jerry. Thank you. So after our app is installed on top of your AEP instance, so our app will automatically read all the data sets in your AEP instance and then present it here for you. So as a user of our tool, you can go to your data set and enable various different quality checks for your data sets. So we have different checks, like validating the data, validating the actual quality of the data, completeness of the data, and any duplicate customer profiles inside your AEP instance as a whole. So you can enable different quality checks. And also, for each of the quality checks, you can set up different thresholds. So setting up this threshold depends upon how much you are going to tolerate issues. So if you are really very keen on your data quality issues, it’s like a low threshold. But if you are somewhat lenient because you know that there are data quality issues in your tool to some extent, you can set up a little higher threshold. But we recommend by default, the system automatically sets up 5%. So your data quality checks that you wanted to do on your data sets and data, you do all those checks and then make changes to your data quality settings. The moment you save the changes, what our tool is going to do is it’s actually going to look at the past three months of data that’s already there in the data set. It’s going to profile the data, identify the schemas, identify the nature of the data. Like if it is an email data, the appropriate email validations are set. If it is a mobile phone number, appropriate mobile phone number validations are set. If it’s a city, country, based on the data, our tool sets various different data quality checks. And also what it does is more importantly, as and when new data is ingested, a batch is ingested, our tool identifies that there is a new data ingested into the tool, into your AEP. Then it immediately scans the data, runs the data quality checks, and then builds the data quality reports. Our various different data quality reports, here is a report on data validity. And here is a report on completeness. This is based on machine learning models and metrics. And here also you have duplicate reports. And we also have a quick feedback section where if there are any reports or any data quality issues that are identified by our tool, it’s not in alignment with your business. You can quickly open up a bug request with us, or you can also ask for new features. And also, more importantly, we have in-app notifications, un-email notifications. So every time when there is new data ingested, and every time whenever a tool runs data quality checks on the data, it provides notifications. Whether a particular batch has good quality or it has data quality issues, it just reports in the in-app notifications. So I’m bringing the in-app notification. It’s taking a little while here. OK, hopefully it’s coming up. So I’m going to just leave it here. Now I’m going to show you a few examples of how using our data facts tool, you can mean AEP users can identify data quality issues at record level and column level and resolve those data quality issues. So I want to start with the first example, which is the Validity Report. Here, I’m going to actually pick one example from here for a selected date. I’m going to select 27th asset for this example. So here, you could see I’m selected a particular date, and then I’m going to actually do a filter. So we do have various different filters. So you can filter these reports based on your data set or even on the batch ID on which the data was ingested and your column name and various different filters. And here, we also have metrics section. For this particular day, we saw 18 million records into AEP, out of which 20,000 records failed our quality checks. So here, I’m going to select a particular data set to do a filter for this report. So I’m selecting one of our data sets. And then I’m actually going to select a particular column. So once you select the data set, the system is going to bring all the columns that is related to that data set. Because that data set is a huge data set, it’s bringing all the columns. So I’m going to focus on a particular column, which is email address. So I’m selecting the email address. And you can select a particular validation that you are very focused on or particular expectation that you are focused on. But here, for our case, I’m going to select a validation that’s been done on the email addresses. And I’m going to apply the filter. So after you’ve selected all your filters, you can apply the filter to look at a specific error record in your data set. Here, I’m applying my filters. So here, you could see I’m focusing on a particular error validation. So here, this data set is basically like a Salesforce Contacts data set. I’m looking at a particular batch of data that was ingested. So in this batch, I could see that almost half a million records were ingested, out of which 20,000 records didn’t have any email address. And for whatever the records, we had the email addresses. We found like 70 records that has invalid email address. So now, at a record level, it is showing you where there are data quality issues with respect to this email. So I’m clicking on the details. I’m drilling down my report. So when you drill down on the report, what we are doing is we are collecting more details about this data quality issue and then presenting it here. So here, you could see, again, here is the record data record that you have selected. Here, we are showing what was actually validated. What we are expecting here is that for this particular column email address, at least 0% of time, we are expecting the email address to be valid. But here, you could see some of the email addresses are not. Here, there is no full domain name. Here, you could see at the end of the username, there is a period. So these are all invalid email addresses. Suppose you are like, these customer profiles are in a segment, and then they are being targeted for email campaign. These emails are not going to go away because these are invalid email addresses. So this is an example of how, using our Data Facts tool, you can identify data quality issues at a record level and a column level. And I also wanted to take a quick moment to show you another example of our tool. Really validating data quality issues for a single column. There are, in real world scenarios, there are some times where you have to validate the quality of the data based on column combination. You might have business rule. When my column is x, I want to see y value in my other column. So our tool supports those kind of column combination logics as well. So here, I’m going to show you a quick example of how you can identify issues with the column column. So here, I’m going to select another date. So here, I’m going to do is I’m going to select this record straight away. Here, actually, what we are doing is I’ll go to the details. I’ll drill down details here. Again, we are bringing in the data quality issues, adding all the details, and then presenting it here. So here, for the column combination, so what we are validating is we are validating if the city is a valid city for the selected country. So here, you could see in the error records, we have countries. We have valid country names. But if you look at the customer profiles, they are not complete. We just have a comma here. And there are lots of records like this. So here, you could see there are like 48 records and any city name for a customer profile that belongs to United Kingdom. Further click on the details here to actually look at the complete record in context with other data. So here, I’m clicking on the details, another drill down that’s available within our tool. So now, what’s happening is our tool is actually trying to get sample complete records from AEP. And it’s going to present it here for your review so that you can look at this error in context of other data that you have in AEP. So here, you could see that it’s just giving you the context, the complete context of the records. So here, you could see the address country was United Kingdom with the city just a comma. There’s no full city name here. And also, what we do is we provide another file, which is like copy SQL query. You can copy a query for all the records because here, we are just only presenting few sample error records. You can take this query, go into your AEP interface, put in the query, and run it so that you can see all the records. So here, I have just to save time here, I have already executed this query. And here, you could see there are like 48 records, which doesn’t have a city name for this country. So that’s another example of the validity errors that can be identified using our tool. So our tool validates each and every data value in your columns and sometimes in the combination of columns and then identifies data quality issues. So now, I’m actually going to show data quality validation on completeness. So how do we validate completeness of data, particularly with AEP? How we are doing is we are actually, for AEP, we are looking at key metrics, like key metrics like number of records ingested into the data set, number of identities that were added, number of identities stitched together to create the identity graph, number of profiles added to your AEP instance, number of profiles updated into your AEP instance. These key metrics give you a scale to measure the completeness of your data. So here, this measurement of this completeness is done through our machine learning model. So here, I’m going to show you an example of how you can identify a completeness issue. Here, I’m going to select a particular date. I’m going to focus on a particular date, which is on 28. To take an example, so here, I’m going to select a particular data set. And I’m applying my data set filter here. And here, you could see I’m focusing on a particular error record. So here, what we are presenting here is on September 28, for Adobe Audience Manager real-time data set, there’s a lot of records that are ingested. But the number of records ingested is not as we expect. So here, you could see, I’m going to go into the details. I’ll show you. So here, you could see on September 28, the number of records ingested into these data sets were just 200,000. But based on the past patterns, past data, what we have seen is there are millions of records that are ingested into this data set on a daily basis. So our machine learning model predicted that at least 4 million records should be ingested on this particular date. But what we saw is only 200,000 records ingested. So this might lead into an incomplete profile creation, incomplete graph creation, identity graph creation. So we are marking this as a data anomaly and then alerting the AEP users, hey, on this day, you didn’t have enough data records ingested. So please check your system. And also, we’re not only alerting on log-up ingestions or record ingestions or profiles added. We also are reporting on higher level. We call this upper bound anomalies. Basically, you’re getting more records on this particular day. On the 14th of September, you’re getting a little bit more than usual records. Sometimes this may be OK. You have more traffic. There’s some seasonality to it. So you’re getting more records. But sometimes, what we have seen is there are so much records being ingested. Sometimes this may be due to a data pipeline issue, running twice, or a batch ingestion issue. So what we are seeing here is how you can identify data completeness issues using these metrics and data anomaly detection. Finally, as a last example, I also wanted to show you how our tool identifies duplicate customer profiles. So duplicate customer profile is a very important detection is very important in AEP because these duplicate customer profiles impact your marketing budget. You’ll be spending more email campaigns to the same customer, or using text messaging, or even mail marketing. Basically, duplicates will waste the marketing budget. So here in this example, what I’m showing here is how our tool identifies duplicates. Again, tool uses machine learning models to identify duplicates. We don’t identify duplicates just based on the repeated values in records. No, we actually take that whole customer profile and then pick specific features in the customer profile, like email addresses, first name, last name, their address, and various other attributes of a customer profile, and use that features to learn about the customer profiles. And for the whole AEP system, we identify similar records. And then we group those similar profiles into clusters. So here, what you could see is that I have a cluster. The cluster ID is 18, and it has 28 customer profiles, which we have marked that as duplicates. So here, I’m going into the details. I’m bringing those customer profiles here. So here, as we are bringing the customer profiles from AEP, I also want to… So basically, the stream map is what it tells you is that it’s not just the current cluster ID, which is 18, that has 28 duplicate customer profiles. There are other clusters in your AEP system. There are lots of other clusters of duplicates, and some of them have like 11 customer profiles or nine customer profiles, but some have even more. So that’s what this is representing. So I’m just gonna hide this stream map. Here, I’m going to look at the actual data. Here, you could see, this is the actual data, like customer profile data, the email address, stuff like that. But if you look at the email address, you might think like, oh, all these are like unique profiles like why is duplicates? But if you look at like your business, you see the business phone number, they all look same. But if you look at like the postal code, they’re same. How can be that maybe a same postal code for different addresses? And if you look at the names, you see the names, first name, James Smith. And I think this is a perfect example, probably like, you know, there is a real James Smith, whose record was like ingested on a multiple times, but looking at the poll on the phone number, definitely just like, you know, seems like somebody is playing with this customer data and then probably it’s a data or something that got ingested into AEP, and these are all duplicate customer profiles. If you send email campaigns to all of them, everything is going to go to the same person, which is James Smith. So this is an example of how using our tool, you can identify. So these are just examples of data quality issues, you know, that you can identify using our tool. That’s a lot more, but for due to kind of time constraints, I was able to only pick up these examples and show to you guys. Thank you, Sundar. Yeah, this demo, it’s amazing that you’re able to fit it all in in these quick 20 minutes. Yeah, so I bring this. Excellent, yes. Yeah, I think one key thing that I want to make sure you touch on before we wrap it up is, you know, in Adobe, there is this AEP data flow monitoring and can you quickly just highlight the main things that DataFacts does differently? That’s a great question. Yeah, you know, we get these questions a lot. Yes, in Adobe, there is some level of monitoring and there are guardrails, you know, to take care of data quality issues. But how our tool DataFacts is different from Adobe’s data flow monitoring. The first thing is, yes, in fact, like it has some guardrails, but what we do with DataFacts is like, you know, an extensive data quality validation. So our tool first, into a data set, it profiles the data. It understands like, you know, what type of data is in each and every column and record. And based on the profile, it automatically creates various different validations. Right, as I mentioned that, you know, if it’s email data, we create email validation. If a column, if we see a particular column doesn’t, you know, have any null values, then we make sure that like, you know, that column in future also should not hold any null values. Similarly goes for like, you know, mobile phone numbers and stuff like that. AEP doesn’t do that out of the box. So our tool automatically understands the profile of the data and then assigns various validations for it. Number two is like, you know, completeness of the data. So completeness of the data within AEP, that is like, you know, a 360 degree view of the customer profile. It is measured based on different metrics. So, you know, we are measuring the completeness of data using historic, like historic patterns, historic like, you know, data, and then predicting future value. That kind of extensive prediction and identifying data anomalies is not done by AEP. Similarly goes for like, you know, customer profile duplicates. You know, as I showed in the example, we are not just looking for values repeated in the records. We are taking different features of the customers and then looking for customers who look alike and then combine them to see if there is, but it’s the diesel duplicate customer profiles, then present it to you. The machine learning based D-dupe detection is not available in AEP. Yes, in AEP, there is some level of alerting mechanism. The monitoring dashboards give you data ingestion and data flow failures. Our tool also does that. And finally, like the root cause analysis, like, you know, which record, which batch, on which day, which time, you know, you had the data quality issues, that kind of a detailed data quality analysis and reports, that’s not in AEP and DataFacts does that. Excellent. Yes, I agree with that. It’s, there’s a lot more that this product shows versus the monitoring. I know we’re out of time. I think the last thing I want to say is thank you for everyone. Thank you, Sundar, for going through the demo. We are not yet in the AppExchange. We’re going to be releasing that soon, but that doesn’t mean that you cannot get your hands on this product. Just hit us up, either me or Sundar, or go to subcrlic.com slash data facts and submit a request. We were actually running a promo, allowing clients to use this product for free with one Adobe data source. So it’s a good place to start. For any other questions, please reach out to us. We would love to talk about the product and see how it can help you and your clients in solving any data ingestion issues. Great. Thank you, Sundar. Thank you. Thanks everyone. All right. See you. All right. Bye.

Additional Resources

recommendation-more-help
3c5a5de1-aef4-4536-8764-ec20371a5186