Moving Adobe Experience Manager to the Cloud - Challenges, Stories, Solutions

A panel with Ian Boston, Tomek Rekawek, and Carlos Sanchez, on how we successfully moved Adobe Experience Manager to the Cloud.

Continue the conversation in Experience League Communities.

Transcript

Hello, Sherry tells me we are live. So welcome to this panel, which is called Moving AM to the Cloud, Challenges, Stories and Solutions. The idea is to give some perspective. So Jean-Michel was telling you about the, I would say the magic we’ve been doing to move AM to the cloud. And in this panel, I’m very happy to have some bright colleagues who can tell us a bit more about that. And I’m sure we’ll have interesting stories. We will be taking audience questions. Tim Marais is watching the stage chat. That’s where you can ask your questions. Thank you, Tim. But we’ll start with a few questions to my colleagues. So first, introductions. Can you briefly introduce yourself and why you think I invited you to this panel? Let’s start with Carlos. Thanks, Bertrand. Yeah, my name is Carlos Sanchez. I joined Adobe cloud manager, oh, sorry, Adobe AM just one year and a half ago. And I’ve been working on the cloud service since then. Before that, I was doing a lot of open source work related with Apache Maven, Jenkins, other companies. Thank you. And you’re Dangin from La Coronia, right? Yes, from the north of Spain. Yeah, we have quite an international panel today. That’s fantastic. So Ian, what about you? Hi, Bertrand. I guess you asked me to do this because you’ve known me for such a long time, probably since 2008 or maybe even earlier than that, well before I joined Adobe. I guess I’m here because I’ve been with Adobe for eight years or more, maybe nine even now. And been deeply involved in getting AM cloud service into the cloud along with team, which have been fantastic over the past three years getting this working and the planning before. And I know Tomek has been working on it for longer than that. And where are you dying for? Oh, in Cambridge, Cambridge, UK. Great. Thank you, Ian and Tomek. Hi, Bertrand. Hi all. So my name is Tomek Ranczawek. I work remotely from Poland for the Basel office. I am here probably because I’ve been working on a few elements of the AM cloud service, the composite node store, which makes it possible to separate the application from content in the AM repository. This led to creation of the AM Docker image and putting our CMS into a container. This container needs to be run somewhere. So I’ve been working on the Kubernetes application in a form of Helm chart that runs AM. And in the meantime, I was part of the team creating the publish cloud persistence based on the Azure block storage. Some pretty cool stuff, right Tomek? And if I remember correctly, you’re a fan of computer games, right? I remember visiting this museum in Berlin with you and you knew about all these things. Yeah. Especially the classic ones. Fantastic. So the first question, AM derives from this software CQ5 product, which was not designed to run as a cloud service. I know I was there when we created it. So I’ll ask questions to one of you, but the others should feel free to jump in whenever you feel you need to. I think panels where everybody agrees are boring. So we’ll try to disagree nicely a bit to make it more exciting. So this question will be for Ian. So could you describe some of the key things that need to be adapted or rewritten to morph CQ5 into the AM cloud service? Yeah, sure. I mean, I think probably the biggest thing that we had to change or had to think about was we wanted to make this system capable of being always up and being fault tolerant and being able to self heal across all of the instances. So really the first thing we had to think about was how do we make AM disposable? How do we eliminate all state from inside the cluster so that we can create and delete instances at will with no downtime, with no impact to production? I think that probably was the biggest hill to climb to get across. I mean, Antomo was heavily involved in making that work, getting the state out there. We are now at a stage where, and have been probably for two years, where we can randomly run around all of the clusters, deleting stuff, and everything springs back up again before you even notice, provided we don’t delete entire clusters, which we don’t. Right. And how long did it take? I remember the first time I heard of this internal project called Skyline. I was not directly involved in it. I had no clue what it was. And then how long did it take? About two years until we got the first websites actually running on that? I think we did it faster than that. I mean, I think it depends when you think it started and when the name came. I’d like to find whoever created that name. It wasn’t me. I suddenly saw it pop up and I thought, oh, that’s quite a good name. Yeah, what’s that? So the name appeared and I think probably within six months to a year, we were going live with our first customers. And then we’ve been slowly ramping up since then. What’s interesting is we made plans a couple of years ago, quite loose plans. And we look back at those plans and say, actually, what we decided to do then, we haven’t changed that much from it. The principles that we put in place then are still there. And I think the reason is that this planning goes back a lot further than this. I mean, if you look at the build systems, maybe six or seven years ago, we started to have a build system that could do a release daily, even though we did them yearly.

Yeah, those were big changes. Tomek, I think you were involved in speeding up the startup of AM, right? Because if those containers take too much time to start, that’s problematic.

Yeah, I think what you are referencing to is the publish startup. So we have a trick at some time to make it faster. So to give some background, in the cloud service, the publish instances don’t share a common repository like outer does, but each one has its own store. And when the outer replicates a page from the outer repository to publish, all the publish instances are receiving this updated page or asset or whatever, and applying them locally in the local store. Now, the tricky part is how we can start a new publish. It needs to get an initial version of the repository from somewhere. We came with this concept of a golden copy of the repository that can be used to start new publish instances. And this golden copy is maintained by, well, golden publish. Another important feature that I need to mention is that this publish persistence engine is an append-only store. So every change is appended in a form of segments at the end of the repository file, and we have a journal pointing to the most recent head record. So coming back to your question about speeding up, in the early version of the cloud service, the starting publish was copying the whole golden repository into its own area. And it was very expensive, time-consuming, and maybe not that useful because most of the data was duplicated between many copies. However, knowing this append-only nature of the repository allowed us to make the process faster and cheaper. We used this feature in a following way. When a new publish starts, it no longer copies the content from the golden repository, but it only references its current head and then appends its own changes locally. So we can say it creates its own local branch based on the golden copy head, and this is much more efficient than copying everything. So I guess to summarize, I think that even if we deploy things in cloud, it’s good to know how they work internally because this publish may lead to a better implementation at a completely different level. Right. And from your point of view as a gamer, does it feel like solving a game when you’re chasing these tricks? I see tricks in a noble sense, but does it feel like solving a game sometimes? Sometimes. The difference, I guess, is that for a well-designed game, you know that there is some way to solve it in an elegant, expected way. In this environment, you don’t have this certainty. But, yeah, it certainly feels more satisfying when you are able to solve a thing than finishing a game. Right. Thank you. There are also other benefits of this structure on there. I mean, in the golden master and the way it cycles every 24 hours, the traditional compaction tasks that we all used to have to do if you’re running a publisher on a VM, they’ve disappeared, haven’t they? And you get a fresh repository, fully compacted, fully optimized every 24 hours with no waste in it. Is that right? Yeah. Yeah. In the cloud service, we were able to extract these maintenance tasks like compaction or, I don’t know, garbage collection in any form from the AEM instance, which can do what it does best, which is rendering content to external Kubernetes jobs, external services. And it’s all possible because first, we know the structure of the repository and we are using this structure. And second, because the state is no longer kept next to the AEM at the same file system, but somewhere in the cloud. In this case, it’s Azure glob storage. Right. Yeah. Switching to Carlos’ background. So Carlos, you joined from CloudBees, who is kind of a pure player, I would say, in the cloud. Did you ever feel depressed by looking at our stuff? Yeah, we have this old, you know, single tenant on premise stuff that needs to be moved to the cloud. And yeah, did that ever make you feel depressed? And are you happy with what we achieved? You got me there. Yeah. No, seriously. I joined CloudBees when I started, before I joined CloudBees, I started playing with Kubernetes. Before that, I was doing a lot of Docker. So this was like 2014, very early on the Kubernetes world. And I said, you know, what can I run on Kubernetes? I was like, okay, let’s run Jenkins agents, right? To scale. We were doing Docker builds before. So Kubernetes sounds like a perfect use case for running Jenkins agents. So I created the, I started the Kubernetes plugin for Jenkins open source. And then I joined CloudBees after that. And I was working also there on taking Jenkins and running it on the cloud. And running it in a Kubernetes environment. Same sort of challenges that everybody that is trying to move to a cloud native architecture has today. People that are not starting from a new project, on a new project that where you can choose whatever you want. Everything before that is kind of monoliths everywhere. And then you take, or some sort of, right? And then it was taking Jenkins, running on Kubernetes. And all the same challenges that we see today with AM. Good things with Jenkins, good things with AM. I mean, it was designed to scale or horizontally to increase the number of replicas. That all existed before. Now it’s the next step moving into the cloud native world. And I think it’s very common problem that a lot of people have. So you’ve kind of been through this journey twice now. Yeah, sort of. It’s good. You get better every time I guess. That’s usually what happens. And you started already pretty good so that’s not a problem. Tomik, there’s a question from you, from the audience. You presented some of that, the stuff you explained before at App2 in 2019. Is it, have there been any major changes since then? Or is it still the same stuff mostly that’s powering this startup and repository copies? I don’t recall introducing any major changes in the architecture. It’s more like we are building on top of it. We were adding new features like automation, like migration, like developer console. And we are still doing it. But I guess the base, the bottom is pretty much the same. Right, so no big changes. It’s been an evolution but it was already kind of the right solution, right? I think that the state that was presented at this App2 conference was already the effect of evolution, of years of evolution before that. Ian, do you recall some changes since then? I don’t think since 2019 we’ve had many changes. We have optimized the way some areas of it work based upon all the evidence we’ve been gathering over the years. We’re actually running the system first for the engineering teams and then for customers. But I don’t think the architecture has particularly changed.

We have another question from the audience. So the new publisher instances, do they copy everything from this golden master or is there some lazy loading mechanism? So they don’t copy anything really. The golden instance keeps its repository in the cloud, in the Azure Blob Storage. And it’s the only instance that is able to write, to append to it. And the normal publishers, they are connecting to the same Azure Blob Storage to a given revision. They are ignoring the future changes. So it’s really like creating a branch on top of a commit in Git. You don’t need to copy anything when you are creating a new branch in Git. It’s just a reference and then you’re writing only the changes. So these publishers are writing their own changes locally, but they don’t need to really copy anything. They are downloading the segments on demand. So there is this lazy load mechanism that the question mentions. Right. Yeah, it’s a great strategy. You know, the fastest operation is always the one you don’t need to do. So that’s certainly a good principle. Can someone or all of you describe one thing that was particularly hard in this move and how you overcame that hurdle? Do you remember something that you thought, oh, no, we’re not getting to do that? I can recall migration, implementing the migration tooling because it needed to connect towards the very, the old world of on-premise instances where we can expect any kind of network persistence infrastructure configuration. So it was very heterogeneous. And on the other hand, we have sterile cloud world where everything is uniformed and requires uniformity. And content migration is about copying the content from this first world to the other. So the way how we move forward with this is by creating some middle ground. In this case, it’s Azure block storage as well. And in the first, it’s up to this on-prem old instance to extract the content and upload it to the Azure block storage in a form that can be then read by the cloud services. And then in the next step, the cloud services has the content in a uniformed form. So it can just ingest it as it expects. So it doesn’t need to know about all these strange configurations of the source instance. So for me, that was probably the most difficult task. For me, I think it was the automation. Doing vast amounts of things that we previously did manually. Trying to standardize all of them. Trying to make it so that, well, achieving it so that actually when you entered into cloud manager and you click that button, almost 100% of the time, not totally 100% of the time, you would get a running instance with the latest version with everything that you needed there. And we’re still adding stuff in that area. If you want to be using launch or you want to be using analytics, we’re adding things in that area. New stuff that’s coming out of sites, a lot of it still needs automating. And we’re trying to make that sequence of events faster and faster and faster. So that those using the system can focus more on actually creating the fantastic experience rather than dealing with all the things that they might have previously had to do manually. And I hope that actually, there’s a lot of benefit to the side of that. It means that the team managing our infrastructure involved in updating the OSs, the OS images are generally only a few days behind releases. And we take a very early kernel to make sure that we’re only a very small number of days behind any patches that come out. And that’s a change for a lot of the engineers working at the core of AEM. But they’re now also doing production. They’re also managing production instances. And the big benefit is they’re getting all the data from these production instances to feed back into the engineering process so they can see exactly what’s going on. I mean, the volume data we’re collecting now compared to what we could see several years ago means those cycles are much more driven by evidence and much faster. Right. Yeah, so it’s big challenges, but also big benefits also for our customers because we get much better usage data that we can analyze and optimize for. I think that can make a big difference, right? Yeah. How about you, Carlos? Do you remember any big hurdles that you thought we wouldn’t get around? Or get over, I suppose. What about hibernation, Carlos? That was a long journey. Yeah, that actually was simpler than we expected. So I remember I was working in a startup doing DevOps automation, DevOps pipelines, like 10 years ago, 2010. And one of the things we were telling customers and the CEO was telling customers, like, now you’re onboarding all these people in the cloud, which was the new thing back then. Now, how are you going to prevent from them? The good thing is, like, you can spin a new VM at any time. How are you going to prevent or control them not doing that and them forgetting to close them, to kill them, to stop them, right? And now you have the same thing on Kubernetes with containers and environments that people create. And I think to some point it’s wrong to ask people to stop what they’re doing or remember to stop something. So we ended up implementing a hibernation solution where if something doesn’t have a request for a period of hours, we just scale it down. And customers can have the, or the users can have the opportunity to go in the next time they go in and get a dialogue to spin it back up again. So this is great for development environments that we use day to day and things like that. Because it’s really hard for people to just remember all these things. The more you can automate, the better. Yeah, and that goes back to being able to dispose of the instances, bring them down, reduce the resource usage. And it sounds like there’s a, you know, we’re doing that selfishly, but actually we’re not. We’re doing it so that we can allow as many people as possible to get free instances or nearly free instances. And as many as they like. And it sort of works really rather well. I mean, in the sense that people can manage their instances. If they forget about it, it doesn’t matter. They can decide when they, you know, not to delete stuff and to put it into cold storage for months on end or just overnight while they’re sleeping. Or maybe not when they’re sleeping. If someone works a longer day than everyone else, their instance will probably be up and running the following morning as well. And how long when that happens? How long does the de-hibernation take if I come back and want to use my instance again? De-hibernation is scaling back up again. So it’s an instance start up. It does depend on the size of the various repositories. So it happens to all the components surrounding the AIM instance, not just the author cluster, published farm. It happens to everything else. Now, if you’ve got a huge instance, yeah, it takes a little longer to come up. But, you know, we’ve only applied to sandboxes. So those are the low cost development instances that are sized to get you up and running, comparable to running Quick Start. You know, they’re designed for that quick up and get running and get going. So those ones go up and down. And I think it’s three or four minutes, maybe five max, depending on the load. Yeah. Yeah. So if it’s a developer waiting on that after, you know, being inactive for some time, that’s not a problem, right? Yeah. We’d like to get it down to a minute or something like that. And we’re working on that, but it’s going to take a little time. Yeah, no, it’s already quite workable, I think. One question from the audience. Does AIM as a cloud service support only Mongo microkernel or TAR? What’s the story between Mongo and the TAR micro, you know, the microkernels here? So out of the tier in AIM as a cloud service supports MongoDB with Mongo cluster. And for the publish, it’s actually TAR-MK, but with a different like plugin. It doesn’t store the segments locally in the TAR files, but in the Azure storage. I think that this is the part of the standardization that Ian has mentioned that we don’t really, we cannot really support any other persistence because cloud gives you a lot. If you take care of making the instances as similar to each other as it’s possible. So internally, AOTU uses Mongo and publish uses TAR-MK, but it’s like implementation detail of the service that is not really visible to the customer. On the other hand, the customer don’t need to worry about tuning the some options parameters of either Mongo or TAR-MK because we are doing it for the customers. We are monitoring all the required parameters. We are taking care of auto scaling the MongoDB cluster. So if you have a lot of traffic or a lot of data, the MongoDB cluster will be scaled automatically. We are taking care of maintenance of the TAR-MK. So I think it’s a great advantage of the cloud service that the persistence that used to be a big problem for the customers to tune the persistence. Now the Adobe takes care of this. Yeah. Thank you, Tomek. That allows us to be up all the time as well. So having that author there with a proper cluster behind it means that we can offload all the maintenance operations that were previously consuming request compute in the author. And it means that we can dispose of instances, scale that author up and down horizontally on demand and do all of those sort of things. So there are quite significant benefits to doing that. Coupled with, we’ve actually got a lot of experience in running the author environment. So it’s, you know, we know how to run it. We’ve got all the data on it. And we can make it work. All right. So we’re basically taking that burden off of the customers’ shoulders, right, by doing it ourselves. Yeah. Over the years that’s been significant. I’ve been out with quite a number of customers running Mongo backend and trying to help them make it work better. Usually successful, but not always. So it’s not trivial to run it. Right. Yeah. Thank you. So Tomek was mentioning the, you know, that on top of moving that stuff to the cloud, we are supporting older instances, older systems by allowing them to migrate to this AAM as a cloud service. As a customer, what challenges should I expect if I want to move, say, I have an AAM set up from six years ago and I’m moving to the cloud service. What are the typical challenges? Well, Tomek, should I start on the well trodden route? And I’m going to hand over to you because you’ve got a lot of experience in moving AAM customers into cloud service. Yeah, please do. Okay. So I mean, the first step you need to do is get your code in a state. You can get it into Cloud Manager. And if you happen to be an AAM customer, you’re probably already in Cloud Manager. That allows you to benefit from all the tests and all the checks that are there to be prepared to move into Cloud Manager. And then as Tomek was saying, there’s a set of migration tooling that brings the content across and migrates you. Tomek? Yeah. So I think first is the code, as Julian has mentioned. The second is the content. We are providing the tooling. It has nice UI for the whole operation is controlled by this UI installed on the on-prem instance. And now it’s able to take care of really large data sets like we’ve successfully migrated instances with tens of terabytes of assets and hundreds of gigabytes of nodes. Well, it depends on the gigabytes of nodes. I guess apart from these two tasks that can be really performed and done and finished, some other challenge is to change perspective from a state in which the customer or the partner is responsible for setting up CI-CD. So it’s a very simple way of configuring all the integrations, maybe, as we mentioned before, tweaking the persistence layers and so on. So in Cloud, you don’t need to do all these things, but you also cannot do them. So it’s two sides. So the first is the cloud service. It’s much easier and cheaper, but getting to the state that these things are taken care of, it might require some some change in the approach. And if I was in managed services, as you were saying, Ian, that would probably be easier because I will already conform to some conventions. Yeah. And I mean, you heard Jean-Michel talking about release validated, release orchestrator. As soon as your code is in the inside cloud manager, then it’ll be tested against the current release. And there’ll be information, we’ll be tracking information about how ready that code is. And we’ll be really keen to help our customers improve their code, make it more suitable for Cloud Manager, make it scale better. There’s all sorts of things that in a cloud native environment, you don’t really want to be doing. And we’ve got some fantastic ways of getting binary data out there without streaming it through expensive or slow connections. And a lot of advice around making that work. And actually, we really enjoy working with customers to make that happen. Right. Yeah. Thank you. We have a next audience question. I think this one would be for maybe Carlos, you can start on this one about the integrations. If an existing system has integrations with external systems, I know there’s some work in progress that we might not be able to talk fully about. But can you tell us what I can expect for these integrations? Yeah, so we take full advantage of, I mean, you can connect outside through anything that is encrypted, HTTPS, you can use certificates. So you can connect to anything that is already secure. We have added some functionality for some systems that may require something like a static IP address. So that’s a possibility because you have to think, going back to the previous question, right? What are my challenges moving to the cloud service? It’s a very common need for people that are moving to cloud native and are doing things in a way how to adapt to this new way of doing things. So you have to be aware of that your instances get restarted continuously, both for upgrades. That makes us very useful. That’s a lot of help in the case of there’s a downtime for any unplanned reason. But it’s a challenge for customers, for people that are used to it that, you know, you cannot tie an instance to a specific IP address. You cannot tie it to a specific host or even we have multiple load balancers to distribute the load. So there are chances that you are not going to get the same IP every time you send a request out. So for that, to help customers with that, we allow we set up some extra infrastructure that is dedicated for those customers that allow them to have a specific IP that is not shared by anybody else. And I think that goes back to one of the key aims that we do when we took AEM into the cloud is we actually we didn’t want to do a brand new greenfield implementation with no history, no shared anything. Customers have integrations. You know, that is what most of their content is coming from. So we we we moved this monolith into the cloud and tried to keep as much functionality as possible, make it possible to do all those integrations to connect all of those systems that are out out in the back end. And that’s our challenge to make it make this environment totally secure, but still allow customers to connect to the services they need to connect to and provide it in a way that they can still run cloud native. But it feels like it’s really dedicated to them. This environment is their environment and they’re connecting out. So, I mean, that mechanism to get out of the cluster as if it’s your own private tunnel to your to your back end. That’s that’s all part of it, isn’t it? Yeah. Yeah. We’re having more more functionality on that sense of helping people that are running on prem or AMS onboard into the into the cloud service with as less disruption as possible. Yeah. And one thing maybe to to mention that the more the application level is that to you know, we have the Adobe I O runtime service, which is our serverless service for Adobe integrations. And we are really going in the direction of enabling that more and more to you know, when you need some glue code, you could write a few I runtime functions actions they are called. And that will help for that. So between this and I O events, I think this was also enabled maybe different more cloud native styles of integrations as we go. Yeah. Yeah. So we have I O runtime integrations particularly useful for scaling out. I mean, if you’ve got a you’ve got a thousand and one jobs you want to do and you suddenly want to take a load for three days, that would you know, would have previously taken three weeks. With I O runtime, you can ramp it up for a very short period of time and then bring it back down again just to get that throughput. So it’s fantastic for that. Yeah, that’s going to be very useful for bursty operations like importations and all stuff. Yeah. We have about less than three minutes left. If it’s like yesterday, Hopin will cut us off at the exact time. I’m a Swiss clock here. So I’m watching it. Ian, I know you’re a keen sailor. We missed each other in Beno Day a few years ago when I was also sailing there, unfortunately. And you once you were telling a nice analogy between crossing an ocean on a sailboat and software development or moving to the cloud. Can you tell us about that? Well, I mean, you know, so if you if you go across, why would you go across an ocean on a small sailboat? People do. Why would you move AM to the cloud? People do. Well, true. Yeah. So, you know, one of the things I say is when you cross an ocean, it’s like it’s like walking down a valley and the valley gets steep. The edges of the valley are like, you know, that steep that they’re curved like this. And if you if you let anything go wrong, you’ve taken a step down to the steepest slope. And if you let a lot of things go wrong, you suddenly find out you’re in this valley and you can’t get out. So when we’ve gone to the when we went to the cloud, the very first thing we did or the early things we did was we put in the monitoring in place to tell us if there were going to be mistakes, if there was anything going wrong. And we jumped on them straight away. You know, we got the evidence back that there could be an issue in this area. And so we jump on it quickly. And that keeps us out of that out of that valley. So, you know, I remember crossing the Atlantic once we came out of a port in Nigeria and noticed the bolt had come out of the forestay of the boat. And so we nipped around the corner to spend a night in a restaurant while the next day we were fixing the bolt. But it’s that sort of thing. And we’ve been doing that over the past couple of years. So we we jump on those issues long before they ever become visible, because if they come visible at the subscale, then we’re in trouble. Right. Yeah, it’s a fantastic analogy. Thank you, Ian. So, Tomek, you had the game analogy. Ian, you had the sailboat analogy. What do you call us? Do you have a great analogy between software development and something else? You have 20 seconds. Probably not enough. That’s a great thing, actually. You know, if you see that you have no time, don’t do it. I think that’s a wise recommendation. So we’ll be off in zero seconds now. Thanks. If we’re still on. Thank you very much for the audience for being here. Thank you for the questions. Thank you to the panelists. I think it was a great discussion and I hope we can do that again soon. Thank you. Thank you. Thank you.

recommendation-more-help
3c5a5de1-aef4-4536-8764-ec20371a5186