Devbridge is officially transitioning to the Cognizant brand at the end of 2023. Come visit us at our new home as part of the Cognizant Software Engineering team.

Failure as a CxO Service?

If you’re the CIO or business leader in charge of ensuring your channel applications perform and stay running for your customers to keep revenue flowing, recent tech blowups in banking must be keeping you up at night. Major outages at BMO and Wells Fargo were truly terrible—and enlightening. How do you make sure your applications scale and self-heal to avoid being another failure? It starts at microservices architecture and ends with continuous testing and delivery with DevSecOps.

Epic fail

These recent banking outages didn’t just impact one single customer touch-point. It hit them all hard (web, mobile, customer call centers, etc.) rendering them inaccessible for the good part of a day or more. Instances like this are classic examples of why organizations need to learn how to build in resiliency and scalability into the application architectures from the start not as a reactionary measure when it’s too late. Any outage or poor performing customer experience is expensive and damages customer trust. This can easily be avoided IF you embrace the best practices of microservices, containers, and DevSecOps. Of course, there is a cost trade-off of building in resiliency and scale. However, can your company afford not to build in these failsafe measures?

Learning from mistakes

Most modern businesses have already built or are extending their core enterprise applications to customer-facing channels such as web, mobile, IoT, voice, call center, etc. In order to be successful, companies must create a highly scalable microservices architecture to have a chance at avoiding the outages for your customer-facing applications. Microservices should respect the 12-factor App rules to ensure that they are self-contained business functions with their own data persistence so that they are a black-box with loose or no coupling is table stakes. By creating microservices that deliver a finite scope of function companies can reduce the application's chances for a catastrophic failure in which one or even all the user interface channels are down due to a single service outage. Yes, the probable cause of the BMO and Wells Fargo outages could have been avoided.

Considering microservices

Microservices, when executed correctly, are self-contained entities that are hosted in containers such as Kubernetes or OpenShift. They are configured and tuned to scale by increasing/decreasing the instances of a microservice to support user traffic AND fail-over to working containers if a container fails or restarting a microservice in the event of a container failure or a cloud failure (such as recent Azure outage). In the past, most enterprises would have redundant data centers and use global load balancing to achieve the resiliency and scale that is provided by today’s cloud and containers. Many think that putting the application or even web services in the cloud is enough. Wrong. It’s not. Companies need to architect and design software and infrastructure to react to failure (some even call this Artificial Intelligence in the Data Center/Cloud).

Failure as a Service

Imagine the application development, infrastructure, and DevOps teams say that the company’s applications can support the user base and failures. Should you as the one accountable for those application portfolio’s trust them? I wouldn’t without proof—that old adage applies – trust but verify comes into play. The tech team needs to build in Failure as a Service (FaaS) into its software.

FaaS is a testing methodology similar to negative testing, but in this case, includes an end-to-end application level. There are several products in the marketplace like Gremlins or open source products like Netflix ChaosMonkey that can inject failure into your applications and the infrastructure containers and orchestrators where they are hosted. Injecting Failure, like unit testing and functional testing, can prove that your applications are ready for prime time by testing how they react to failures of every part of the application architecture one by one. The FaaS will run automated testing to simulate failure of each piece and determine if your application can properly keep it running or respond within established non-functionals of return to operations (RTO), recovery point objective (RPO) and mean-time to recovery (MTR).

What’s great about injecting failure?

You can trust AND verify that the work has been completed by your staff when they say their apps are ready. You can also go further to implement this against production on a regular basis to ensure that regardless of which environment your application runs in, you know with certainty, it won’t fail.

To be ready for failure, teams should already be using DevSecOps from Sprint 0 or the first build of the code. Tech giants like Amazon, Google, Netflix, and others all embrace the mantra, “You build it. You run it.” This puts accountability and responsibility in the delivery team’s hands. It’s up to them to ensure that the quality is there. If it’s not, and things break, they will be ones getting woken up in the middle of the night to fix the failure. If DevSecOps is in place, then the microservice that is broken can be fixed, tested, and deployed in a matter of minutes or hours vs. days or worse months like most enterprises take to fix issues today. For more on DevOps, checkout Devbridge’s Pragmatic Approach to Overcoming DevOps Roadblocks white paper.

What’s an enterprise company to do?

Well, most enterprise companies employ tens of thousands of employees whose job it is to design, build, test, deploy, and monitor systems that prevent outages and keep customers happy. Ensure that companies incorporate Failure as a Service into their DevSecOps process and test for their microservices so that applications self-heal and self-scale. Companies should let their CxOs sleep at night, so they have the headspace to think about the next big revenue deal! BMO and Wells Fargo probably did not do this. Shouldn’t you?