Asset Bulk Ingestion

This session will introduce the new feature of assets bulk ingestion in cloud service and its scalability and performance.

Continue the conversation in Experience League Communities.

Transcript
Hi, everyone. Thank you for joining June and I today for the session where we will talk to you about the new Bulco import tool in AEM Assets as a cloud service. And how this tool can help cut out a very essential part of an AEM Assets customer journey, which was most often a customization, took a long time to write and run. June and I both work in the AEM Assets as a cloud service engineering team, where he is a senior computer scientist, and I am an engineering manager. Our agenda for today, I’ll cover the bulk import to overview and its best practices, and then hand it over to June to go over the architecture, a quick demo, and then we’ll both come back for a quick Q&A if time permits. Before we talk about the bulk import tool, I would like to take a minute and talk about assets 30, an internal initiative at Adobe, which is focused on reducing the time to value for assets customers, while also ensuring longer term adoption by providing contextual in product guidance for the assets users. Asset ingestion happens to be one of the four pillars of the assets 30 initiative. If you can’t ingest your assets into AEM quickly, there’s no way to reduce time to value. The bulk import tool is something that lets you import assets into AEM, either for the initial migration or for special large migrations, for example, a product launch. For day to day uploads to AEM, we still recommend using the desktop tool or the AEM Assets UI. The tool can be accessed in the AEM UI by going to the tools menu, then to assets, and then you should see the bulk import card. It’s only available for administrators. Once you go to the bulk import tool, you are able to create an import configuration. You are presented a form that you see on the right side. As you can see, currently we only support Amazon S3 or Azure Blob Storage as the source for your assets. There are a few other fields I would like to point out that control the behavior of the import job, like the size, minimum and maximum, the mime types to include or exclude, and import mode, which gives you the option to either skip, overwrite, or create a version if the asset already exists in AEM. For more information, you can go to the link that points you to the public documentation for the tool. Once you’ve created a configuration, you get the action buttons to either edit, delete, check, dry run, or run the import job. Edit, delete, and run are pretty self-explanatory. The check button lets you check for connection failures or if AEM is able to connect to your source storage location. The dry run gives you an estimate of the import job, things like how many assets will be imported, what the total size of the assets are, and an estimated time. The action buttons you see are contextual. Not every button shows up every time. For example, the edit button will not show up if the job is running or has been queued already. The check button doesn’t show up if the configuration is broken or the job is running. And finally, the view assets button will not show up unless the target folder is created in AEM. Once you start the job, the action buttons change. Now you’re able to stop the job, look at the job status, or view the assets in AEM. The view assets button takes you to the target folder that you had configured in that job configuration and see which assets have already been migrated to AEM and the folder structure. The job status UI gives you a more detailed view of the job. For example, how much of the job is finished, how much time or rather estimated time is still remaining for the job to complete. If it fails, it’ll also give you failure information. While the job is running or it has finished, you can see that the folder structure in AEM mirrors the folder structure that was in your source storage location. So if you have folders one, two, and three with assets four, five, and six in your storage location, the same structure will be available in AEM. You can still run the AEM specific asset processing on the assets. For example, you have a post-processing workflow configured after the asset compute service is done to assign some metadata or do some business specific processing, that will still happen. You might also have a processing profile, for example, to generate watermark renditions, which again will also happen once the assets are imported to AEM. Here are some best practices for the tool. Like mentioned, the tool is only available to the administrators in AEM. We recommend you click the check button to check if there are any connection issues before you start a job. We also recommend using the dry run button to get a pretty good estimate of how long it will take. Use the import mode configuration to change the behavior. Sometimes you want the assets to be replaced. Sometimes you want it to be skipped. This is your friend. Don’t worry about having to clean up the names for the folders or assets because the tool will automatically escape unsupported characters and make them AEM friendly as it’s importing them into AEM. And we recommend using smaller batches for your asset import. One of the value propositions for AEM as a cloud service is continuous innovation. With that theme, I would like to also point out a few things that are in the pipeline for the bulk import tool. We’re going to add support for you to run schedule imports, either it’s a one-time or recurring. We’ll also add a feature to delete assets in the source location. So once the assets are imported into AEM, then we can delete it from your S3 or the Azure blob location. Like I mentioned earlier, currently we only support S3 and Azure. Support for SFTP is also in the pipeline. And finally, we’re also adding regular expression based filtering to the bulk import tool. I’m sure everybody is now wondering how much time does this tool take. Let me tell you, this tool is pretty fast. In our tests, we’ve been able to import about 2,000 assets per hour. Our customers have used the tool to migrate about 23 terabytes of data into AEM assets as a cloud service using Azure S3. And we’ve done all of this with zero downtime, primarily because AEM assets as a cloud service upscales or downscales automatically. And we also run the import job in the background and it’s offloaded so it doesn’t affect your user experience. June will now cover the architecture, which sort of explains how we’re able to do this with zero downtime. Over to you, June. Hi, everyone. I am June Zhang, a senior computer scientist working on the AEM assets. As an engineer implementing the bulk import, I’m going to introduce some technical aspects of the tool. Before diving into the architecture diagram, let me quick through to the most important things for assets in AEM cloud. The first thing is the direct binary backed by the Jackrabbit arc. With this, we actually don’t store the real binary in AEM. Instead, we’re using the error to store the real binary and we only store the binary reference in the AEM data store. That also means that uploading and downloading binary is to and from error instead of AEM. And secondly, we are using the asset compute service for generating the summary renditions and metadata extraction. The asset compute service is serving as implementation over the AWS runtime. Back to the diagram of the bulk import. After admin runs the job from the UI, it will actually call the cloud API to get the page list of the assets from the input source. And we are managing a transfer job pool with higher concurrency. The transfer job will use the error to the block request API to do the transfer from the input source to the error data store. This is the most important design for the bulk import. With the usage of the put block request API, the AEM itself doesn’t need to stream any binary. Then after the transfer job is done, the AEM will dedicate the asset ingestion job to the asset compute services. With this, AEM itself doesn’t handle any heavy job like transferring and processing. It ends up with pretty reliable and scanable implementation. The design itself is not so complicated, but the bulk import is well implemented with lots of consideration over the resilience. You may know in the real world, in Jest run for natural repository, we meet lots of challenges like performance, reliability, notifier supporting, and even some special character handling. All these are perfectly reserved in this implementation. After this deliver, one of our real customers successfully migrated dozens of terabytes assets in just three days. Next, let me do a quick demo. After the user assesses the parking for the UI, we need to create the import config. We could give you the name. Currently, we are supporting both the input SUS type, the AWS S3, and the AWS Blob Storage. For demo, I’m choosing the Blob Storage. We give it a storage account name, the container name, and the asset key for the credential. We could also specify the source folder to just input some assets under some source folder in the input source. We could also filter the assets by its size. We could specify the minimum size and maximum size for the input source. We could also filter the assets by its MIMO type. We could include and ex-close MIMO type. For instance, if we just want to import some images, we could just give it an include MIMO type for images. The input mode indicates when the assets have been imported already in AEM, what we are going to do during the importing. Skip is the default mode. It means that if we see the assets exist in the AEM target location already, we will not import them again. This is quite useful if the users want to increment or import them. For instance, if we finish importing once from the input source, and the users add some additional new assets in the input source, and we want to do the importing again, the skip mode will not be importing the finished one. So it’s pretty useful for the users who want to do the incremental importing. We could give it to the target folder in AEM. After creating the config, the first thing we suggest to do is to check the healthy of the config to see if there is any connection issue. And another thing we also suggest to do is doing the drive run. The drive run will give the estimation of how long time the job will run and how many assets will be imported. The drive run is very useful for the future to give the estimation and give the confidence for the users before they run the bulk import job. And since this is really reliable, I’m not afraid to do the real-life demo to do the importing. While it’s importing, we could check the job status. We could see some progress over here. Yeah, just to refresh, the importing job is done. We could view the assets after importing. You see, every asset has been imported already, even some folder. So it will keep the folder structure as the same as the input source. For instance, this is the input source. We’re actually uploading, importing the new uploading, importing the new we are importing the this folder, the new folder assets. It will actually keep our folder structure as is and same as the input source. Okay, that’s my demo. Thank you. Is there any question? Hey Jun, I’ll start sharing my screen again. Okay. I have a few questions in the chat part. So let’s go. I’ll read off the questions and then we can answer it. Arnie’s already answered a few. So let’s start from the top. Let’s see, where is it? So Raf is asking if there’s an API available or somewhere to schedule these without having to invoke through the UI. Raf, we’ve talked about scheduler is a feature that we’re adding soon, but Jun, is there an API available today? Not yet. So we do think that making or adding the feature that you can either create a recurring schedule or a one-time schedule, the need for having an API would be a little lesser, but adding an API is also something we’ll consider in the future. Yeah. Is it in our root map? Yeah. What type of folder does it create? Does it create a Sling ordered folder or a Sling folder, Jun? Sling folder. Okay. There is no configuration available for that, Suman, because obviously Sling folder is a more scalable node type. So we don’t give you the configuration to create an ordered folder. It just creates the Sling folder out of the box. If you have a specific use case for ordered folder, I recommend you get in touch with us and explain the use case and we can figure that out. There’s the size of patches. I’ve already answered that question. So we’ll move past that. So Suman, you’re asking if this is available to download and use it? The tool is actually available out of the box in cloud service. There’s nothing for you to install, as long as you’re in a fairly recent version of cloud service, you go to tools, assets in the AM UI, you should see this tool. And if you don’t see it, either way, I think you should update to the latest version of cloud service because we’re constantly adding new features and doing bug fixes, but it should be there already. Once copy to target blob storage complete, how AM get to know that the blob storage to location to update ref? I’m not sure what this question means. Yeah, I think that’s a good question. It’s really into some technical details. Actually, the direct binary will generate the reference and the location for the data store and we are using that location to doing the transfer. So I think that’s the answer to the question. I mean, AM manages all the things and knows everything. Jun, do you want to answer Gunar’s question as well? I don’t know the answer to that. Limiting, limiting factor of scanning. Oh, so actually for the current performance, the majority, there is actually no real limit because with the dedicated or the processing and the transferring outside. So the majority of the impact is a little limit. We don’t want any real customer to, for instance, but as the computer service, we actually limited the processing ability. So for now, I think this is currently 20K per hour is pretty good number for our customer. Okay, Subhan is asking, does it work with the customer owns AWS Azure bucket or only a provided one? So that’s a good question. I should have clarified that. So thank you for calling this out. This actually is separate from what Adobe provides, right? So you have to have your own AWS or Azure bucket because there’s no way for you to get access to the Adobe provided one. So this is separate from what Adobe provides as the data store for AEM as a cloud service out of the box. The next question is, Elmera is asking, will the default metadata schema? Yes. So everything on the AEM side processing metadata schema still remain as is, right? So you will still see the schema. If you have any custom processing workflows that process after Azure compute service finishes, they’ll still do their thing. So everything from that perspective should still be the same. Tag, this is available on sandboxes. Yeah, this should be available on your sandbox again, as long as you are on a fairly recent version. Again, I’ll point out the fact that you should update to the latest version, but yes, it should be available on sandboxes as well. Jun, the question is, does the utility retry failed assets? Yeah, that’s a good question. Actually, we have lots of retries during the transferring. So it means that if the transferring hit any network issues, there is a retry to make the job a success. So that’s no such case. We have enough retry and never see any failure assets. So yeah, but yeah, I mean, that’s a retry. Subban, it is not available for non-AEM as a cloud service customer. This feature is specifically only available for cloud service. And it’s up to if they are using the direct binary. So I think the answer is yes. For now, it’s only Sky9 using the binary. And then what about the service? Is it offloaded for performance? I mean, we use as a compute service and then direct binary upload and download, right? So Jun, do you want to explain more about that? Is the bulk import service offloaded for performance? That’s the question. Yeah, that’s just as I introduced, the address to put the block API. That API, the address provided is allow us to transfer the binary from the input source and to the target source. So the AEM itself is doing the stream any rare binary. So I mean, the AEM just don’t handle any heavy job. That’s why we’ve got the pretty high performance. Subban is asking, will asset compute service cause memory issue if it’s not disabled during upload? It will not. So asset compute service actually doesn’t even run within the AEM JVM, right? It’s offloaded to IO runtime, which is Adobe serverless offering. So this has nothing to do with the memory that your AEM is using. Okay. Yes, Prateek. Again, this is an S3 or an Azure blob location that the customer owns. So you will have to provide the secret keys or access keys when you configure the job. If I go back on my presentation and you look at this thing, see how there is an Azure blob container or Azure access key. Obviously, this changes depending on whether you select Azure or Amazon S3, but you do have to provide access information to the tool. Okay. Jun, does it work with the simple queue service to pull objects from the buckets? No, so far no. We don’t have such support. Yeah. Paul, there is no auto name change function. What it does is it basically escapes the unsupported characters and make sure that the node names are JCR compatible. There’s no rename functionality. Yeah. For clarification, essentially we are using TruePass. One part is the JCR title. The JCR title is the property where actually store whatever the original name it is. But the JCR name needs to sanitize the info, meet the AEM compatibility. Okay. Jun, another question is when an asset fails, will it just be skipped and try the next asset or will it fail the whole process? It will skip. Okay. It’ll skip and try the next asset. Yeah. And it will also report in the final. Okay. Reindicate which asset is fair. Okay. So, Tarun and Paul, about Google Storage and Dropbox or Box. Like I said, currently our roadmap is only to support SFTPs in addition to Azure and Amazon S3. But Arnie is on the call and will take a note for adding support for Google and Dropbox or Box-like solutions. And we’ll discuss it internally and hopefully be able to add it in the future. So Paul is asking, what if there are spaces in the name? What happens? Jun, I believe the spaces are replaced by dashes. Sorry, I don’t remember. But it’s using the JCR utility, the API. That’s API which escapes the illegal characters. Which basically, I think it replaces space with a dash. Is there a plan to make this available for AMS customers? Arnie, if you’re still on the line, I don’t know if you can talk, but do you mind answering that question in the chat? Plan to make this available for AMS customers? Angelo is asking, the report will show failed assets. And is there a way to import just the failed assets? Go ahead, Jun. Yeah, that’s a good question. Actually, that’s why I introduced escape mode. Escape mode means that you will not import the finished one. So you just rerun the job and you are only importing the failed ones. So yeah, it’s just the work asset. Paul, thank you for saying this is great. We appreciate it. We are happy that you guys like it. Satish, the metadata mapping, again, is something that is on our roadmap. Currently, it’s not available, but it will be soon available where you can actually upload a CSV file along with your import job. And whatever you define in the CSV file will get assigned to the asset. Today, it’s a two-step process because the metadata import feature exists in AM already. So you use the bulk import tool, you import all the assets, and then you create your metadata CSV file and then use the metadata import feature in the product to assign metadata. We’ll make it a one-step, seamless process in the future. Okay, we have one minute left in the session. I don’t see any other questions. Oh, one more. Does the migration report show AEM paths for all assets? No, no. The answer is no because there are too many assets. But actually, we have some debug interface to get the list, but it’s not a public document. But if you really care about that, maybe we can talk to you later to show you how to get the list. Okay, one final question. Paul, yes. So if you use skip in the import mode dropdown, it will basically skip all the existing assets and only import what the delta is. Paul, that’s what the skip does for the import mode. All right. We thank you guys for joining the session. We’re at time. As always, you can reach out to me or June LinkedIn. We are always available. Please send us questions, and we’ll do our best to answer. Thank you, everybody. That’s your final question I’d like to answer. The time is going to be integrated with the work funnel and approval. Actually, after the import is done, it’s actually just doing the post process. For currently, the AEM can now have the post process. If you configure any work funnel there, it will be run. So the answer is we could. All right. Thank you, everybody. Hope you enjoyed the conference. Bye. Okay. Bye.

Click here for the session slides.

recommendation-more-help
3c5a5de1-aef4-4536-8764-ec20371a5186