Deep dive into Adobe Experience Manager as a Cloud Service continuous delivery model
Based on the whitepaper: Adobe Experience Manager as a Cloud Service: Continuous delivery model we will share details about our delivery model to increase customer confidence in CS and support the adoption of customer functional tests.
Continue the conversation in Experience League Communities.
Transcript
Hello everybody, welcome to this session about a deep dive into AM as a cloud service continuous delivery model. I am Andres Bott and I’m currently working as a cloud reliability engineer in AM. A little bit about me, I’ve been working with Adobe for a little bit over five years. In these five years I’ve worked in different positions doing different roles. But lately I’m mostly focusing on reliability and deployment engineering in cloud service. For this talk, the audience of this talk is mainly focused towards AM developers. A little bit more on the technical side, I think the audience is quite good in this context. Mostly for people that are already using AM as a cloud service or are interested or intend to use it in the near future. And want to learn a little bit more about the internal workings and what we do on our side for cloud service. On the agenda for today, I will give an overview on the continuous delivery model, how we structure the CI-CD pipeline. And then we will follow the CI-CD pipeline from the commit of a developer on the most small part until the end, until this commit has reached production. At the end we have some time for Q&A. So if you have any questions, please write them in the chatbot and we will try to answer them at the end of the session. So without any further ado, let’s get started. One of the main premises of cloud service is always current. So the intent of always current is to have always to be always running on the latest version on the most Chinese, more best patched version of AM. And to achieve this goal, we have this continuous release model that will always make sure that new product features, improvement and code fixes are deployed on your AM environment. These tests, these changes are thoroughly tested and delivered continuously without interruption to your AM deployment. Also, this continuous delivery model takes the burden of costly maintenance and upgrades of these environments. So when working in customer care, I’ve helped many customers and I’ve seen many customers that have very complex and very tedious long running upgrade projects. And sometimes we are talking about customers not upgrading once a year, but every two years because of the cost of this upgrade. With this continuous model, we try to get that burden away from the customer, from you, the customer, and build it in as one of the features of cloud service. So we are moving from very costly upgrades to almost always running the latest version without any costs for you. How this looks on our end, on the implementation side, is we can see that we have hundreds of developers contributing to a common code waste, that’s cloud service. And then we have a CI CD pipeline that is constantly running and you need to think about it like a funnel of multi-steps, multiple phases of automated tests that one run after the other. And we only can move to the next step as soon as the previous step has passed correctly. This CI CD pipeline also needs to fulfill many different technical and business requirements. That’s on our side on the release engineering to make sure that this is all fulfilled. And once this whole pipeline has finalized and reached the end, you will see the outcome of it as a new version available in your cloud manager UI. So this is a quick overview. And now we will have a little bit of introspection on every of the individual steps, on every of the individual phases that this pipeline has from the initial state of a developer that does a commit into the module. So the first step in this pipeline is a module level validation. We need to understand a module is, we can think about it like sites or assets. We have bigger modules, smaller modules, but all over all, the teams that take care of the modules have a lot of autonomy in the way they test and the way they define their testing strategies. And there are different quality gates. Also, they have a lot of autonomy in the tooling that they are using. Some teams are decided to use Jenkins, some teams decided to use CircleCI. Depending on the type of module or on the type of code, we have some Selenium UI tests in place or Java unit tests, Java HTTP tests. This is really up to the individual teams to decide what is the best strategy to follow in verifying that their module is, their quality in their module is assured. This is still happening on an independent way. So at this level, sites teams and assets team, for example, do not have an impact. And this is the first iteration where we already see developers have a feedback or if their change has some level of impact or not. But remember, this is only on module level. So here the teams are still acting completely independently. Once this has been verified, we move to the next step in the pipeline, which is the PR assessment or pull request assessment. In this pull request assessment, the modules are now integrated into a unified big repository. And here the incoming changes are all verified on a pull request model. This means that whenever a new module or new library component version is released from the individual teams, this update is propagated into this repository into a pull request. But this pull request is not directly accepted, but it needs to also pass the pull request verification. Here the changes are still individually tested, but already in the whole context of AM. We have a big set of tests, different tests like Java HTTP, some Java Selenium. Also noticeable at this level, we already have some security tests in place. And this test every individual pull request in the AM context. And to make sure that one change on one module does not affect another one. Let’s say, for example, I am a site developer and I need to enhance some feature on some backend API. And for our functionality that I want to introduce, the moment we reach this level, maybe my change is breaking the functionality of somebody else, of an assets team. At this point, this pull request would be rejected because it’s not passing all the validations. These tests are still maintained by the different teams, but we want to ensure that we don’t break cross-team, cross-module functionality. And as mentioned, only if all the validations pass, then this pull request is merged into this big repository. Once we have reached this point, the next step is the end-to-end validation. For the end-to-end validation, this is the first time we integrate this code change into our cloud platform. This means now we are running on a Kubernetes cluster and we are using the same tools and the same platform as cloud service. To ensure the full end-to-end flow, we also make use of Cloud Manager on deployment time. And we deploy this potential new release or version. I’m going to use the term version and release to represent the same during this talk. To deploy this new release, this potential release on a cloud platform. Once this is deployed, we have another set of automated tests, but this time it’s a smaller set, but they cover a bigger ground. For example, we have Java HTTP functional tests. Noticeable, we have a, in this case, a replication test that takes care of replicating a page and making sure that it’s actually published correctly into the published instance. Or we also have a Selenium UI test that’s noticeable. That’s the IMS login. So this test ensures that the user that is part of the admin console, what we call internally IMS, is able to log in into the author instance. As for the test scenarios, we are testing different scenarios, being the upgrade scenario the most important. Since this is the intended flow for customers, they are intended to always be upgraded to the latest version. So we keep an internal AM deployment that we constantly upgrade to the new version and then we run the validations on this version. But also we test the new environment scenario. So this means a new from the clean vanilla environment that has no code at all. This is to ensure that a new customer coming to cloud service will become a perfectly working deployment or environment. Also noticeable features that are hidden behind feature toggles are tested with a feature toggle enabled so that the moment we enable those features, using the feature toggle, this will not have an impact on the actual environment for the customers. Once we have finalized these end-to-end validations, the code moves forward into the next step. The next step is the customer update assessment. And since we run customer code and customizations on our cloud service environments, we also need to make sure that our code change does not impact your customer code. For this reason, we have the customer risk assessment that tries to assess the risk of a change of causing some problems with customer code. How this works is we clone the customer environment twice. One that is being kept on the known good version and it will guide us a baseline. And the second update is going to be this new release candidate, this new version candidate. And then a set of validation tools are executed against those two environments, those two AM cloud service environments. And the risk score is calculated. And depending on the risk score, if this score is too high, we say, OK, we reject this change. It’s not worth it. It’s probably going to cause too many problems. And then somebody needs to go in and investigate and figure out what happened there. As for the tools that we use on this level of validations, I think it’s quite interesting. We have, for example, this visual comparison tool that will take the top most requested pages from the customer environment based on the logs that we have already available, render them as an image, and then based on the difference on a superposition, we can calculate how much is the difference of those two images. And if the image differ too much, the risk assessment score will go very high, indicating that there is a problem with this new potential version. We have an example here on the screen where this actually caught a real regression. This is using the sample page, the weekend sample page. And in this case, we see that the image on the left is very stretched vertically. And this release obviously had a problem and was never made public. Similar to the visual comparison tool, we also have an error log comparison tool that will analyze different error log messages. So error messages, warning messages in terms of quantity and order. And if they differ too much from the baseline to the upgraded version, the risk score will also go higher. Additionally to those, we also have a pod startup checks that makes sure that the pod is starting correctly. If this is not the case, obviously the risk score goes high. Similarly to the pod startup, we also check that the landing page of your customer application is working correctly. It’s not returning a 500 error, for example. If that would be the case, the risk score goes also very high. And in this case, it’s also noticeable to mention that as part of this clone environment, we also run the custom functional tests. So tests that you as a customer can write to make sure that your application works correctly. So in the case that your customer functional tests don’t pass, the clone environment will not come up correctly. And in that sense, it will also have an impact on the risk score assessment. So this is a good way where you can make sure that if you want to ensure that your application has no problems, if you have a test covering it, you also have an impact on the score on the customer risk on the risk score assessment. So once we have passed all these quality gates on different levels, we mark this release or this version as public. We say, OK, this is now the new release. This is now available publicly available on the part to this. We also publish a new SDK. So for developers that are using the SDK, this is also published at this moment. And in given that moment, this new version is available on cloud manager to be used. Mainly you can go log in into cloud manager and you will see a new pop up saying that a new version is new version. Upgrade version is available and you can manually trigger the upgrade. Or if you would create a new environment, it would also create a be created using this latest release that we just validated. What we did not get to do is upgrade the existing environments. For this, we have the last step that is here on the very last right there with the rollout. And to perform this rollout, we use another tool that we internally call release orchestrator. And release orchestrator is basically the one that takes care of upgrading all the customer environments that have been enabled to to to become this this upgrades. Here it is very interesting to to express that this is a multistage canary deployment. What it means is we have a big set of customer deployments, but we start very small with a small set. To be honest, we start with a small internal set of customers and then we move to the next level. That is a little bit bigger step and the bigger one until we have upgraded all the environments. If during the multiple phases we identify any problems in between, these changes are rolled back and then we analyze also what’s happening there. Again, very noticeable at this point, release orchestrator also makes use of cloud manager to deploy this upgrade. This means that the customer customer functional tests are also executed. Again, if you want to make sure that your code is not impacted by a code change from a new version, this is the best place to chime in and make sure that your application works as expected. And after we have successfully executed the release orchestrator, now all customer environments are running in there with the latest and most shiniest, most secure AM version that is available. I gathered this slide with some interesting numbers. I found them very valuable to give a little bit of context of the size of this CI-CD pipeline that we are currently executing. So if we go back on the PR assessment where all the teams are contributing into the big repository, we are talking about an average of 25 pull requests per day. And then for every individual pull request, we have around 2,235 tests executed on every single pull request. To guarantee the throughput on this level of pull requests, we are executing this on almost 50 virtual machines that run in parallel these tests. Then when we move to the next step with the end-to-end tests that cover a bigger ground, we are actually executing 114 end-to-end tests. For the customer assessment, we are currently cloning 124 customers. This means, since we have always two clones, almost 250 environments that we are running to validate the customer code. But what I found very proud of is the fact that with all this machinery, we managed to achieve over 99% of successful customer upgrades. So the quality, we really managed to deliver high quality with very little impact on the customer experience in the end. And with that, I think we have provided a good overview on the details of the continuous delivery model. And I think we can move towards Q&A. If there are any questions on the chat, we can start with those. If not, I already have captured one question that has been asked previously. Jörg, do we have any questions? Nope, I haven’t seen any questions both in the session chat and not in the Q&A chat. OK, well then let’s answer this question that has been brought up before. What’s the difference between the product functional tests and the end-to-end tests? So if you remember when we were talking about the end-to-end tests, we have something over 100 tests that are executed once we have deployed this code to the cloud environment. But if you are familiar with Cloud Manager, we also have a step that’s called executing product functional tests. The main difference at this stage is that the product functional tests that are executed on every Cloud Manager deployment that you, the customer, can trigger is that this is a subset. It’s a smaller subset, but covering really the bare minimum absolute functions that are required, that we have considered required for cloud service. In this case, I was mentioning before, replication is one of the tests that are executed as part of the product functional tests. We also have a test that tries to create a page and delete the page. So very basic operations to ensure that your customization does not break super basic functionality. The end-to-end tests, on the other hand, are a little bit broader. We are covering test scenarios that we don’t intend that customer code could break, but a product update might have a conflict with. So I’ve seen now a new… There is a question from Eka Bam. Hopefully get the right name right. If there is any API deprecation or API changes, how will this be addressed in this model? I hope I understand the question correctly. The intent about API changes is to keep backwards compatibility and to really announce the API deprecation and only remove it once. We can ensure or we have identified that no more customers are using this deprecated API. In… Yeah, I think I hope this answers your question. I don’t know if this was what you intended to… …to ask the question. So while we wait for… We still have a couple of more minutes for questions. We’ll still be here for answering questions. While we wait, just a friendly reminder, you can log into experience-league.adobe.com. There is a forum and if there are questions that you don’t have at the moment or they do not come up now, you can go into the forum and we will follow up on the forum on particular questions about this session. So feel free to log in and continue the conversation over there. Okay, so we have another question. There was a final question by Joseph. You mentioned that each PR is subjected to over 1000 tests. Is that leveraging cloud manager? Okay, so the pull request level validation is not yet leveraging cloud manager. It’s a previous step where we only test in the AM context. So we don’t test on the platform, but we run all these tests on the same code base and the same code level that is going to be deployed later on the cloud platform. And then as a follow up, the end-to-end test is indeed leveraging cloud manager to fulfill the full end-to-end flow of validations. Okay, and there will be the follow up question. If yes, is there a way to expose this feature to customers, but I think this is not applicable. Exactly. So these tests are, this over 2000 tests are very fine-grained, very specific for AM. Running those tests on every cloud manager execution, I think has been discussed at some point. But from the cost to benefit ratio, I think it has not been, we have decided against it. The reason for, but that’s the reason why we have the product functional tests and we have also the end-to-end tests that are available to cover these scenarios. I think we are reaching. Oh, Joseph. He’s interested. Thank you, Joseph. Another question. Currently, we see that the customer tests are triggered only on the production pipeline. Is that accurate? If yes, is there any plan to expose this to the dev pipeline as well? Yes, that’s accurate. So currently the product tests are indeed only executed on the product production pipeline. There are some conversations happening to, and this is a request that has happened before. So you’re not alone there to allow, to execute those tests also as part of the dev pipeline. I cannot give you an update on when or if this will happen, but I can definitely say this request has happened and it’s in the conversation. Thank you, Andres. So I think we are reaching the end of the session time. I don’t see any other questions popping up either. So I will say thank you everybody for joining. I hope you enjoyed this session and you got a little bit of more in-depth insight on what AM is doing to ensure quality on AM as a cloud service. I hope it was meaningful and valuable for you. Have a nice evening, morning, good day to everybody. And bye-bye.
Additional Resources
recommendation-more-help
3c5a5de1-aef4-4536-8764-ec20371a5186