Moving towards Cloud-Native and taking architectural/design approaches to create Cloud-Native applications is a trend nowadays (for good reasons), and this trend is not going to slow down anytime soon. This has a lot to do with the big impact that cloud computing has made on the software industry and the well-known success stories of numerous companies who have adopted this approach and are continuously delivering very high-quality products to their users, with a minimal user impact in terms of downtime and other issues ✅
More often we only hear about the success stories and not so much about the hundreds of failure stories on those adoptions. For companies who want to win big with their Cloud-Native strategies and practices, if testing a Cloud-Native application is viewed as just an optional ancillary activity and not seen as an essential, valuable and integrated part, then there is a high risk of delivering a poor quality product with poor customer/consumer experience, which will result in an increased customer churn, decreased inflow of new customers and subsequent loss in business. In short, another failure story ❌
Before diving deep into the different approaches on how to effectively test Cloud-Native applications, let’s first try to understand the term “Cloud-Native” and what does it mean when we say that a company is “going Cloud-Native” ⛅
Table of Contents: Testing Cloud-Native Applications
- Going Cloud-Native
- Approach to Test Cloud-Native Applications
As more and more companies are migrating (or planning to migrate) from on-premise to cloud, their focus is to design, architect, and build applications that can be easily scaled, deployed to the cloud and have the abilities to completely utilize the advantages of the computing models provided by the cloud platforms (like AWS, Azure, or GCP).
Going “Cloud-Native” and creating Cloud-Native applications refers to the approach of design, architect, and build distributed software applications in such a way that they can take complete advantage of the underlying PaaS (Platform-as-a-Service) and IaaS (Infrastructure-as-a-Service) service models offered by the cloud service providers. Most often these applications are built as a suite of small Microservices (based on the Microservices architecture style).
These loosely-coupled Microservices run on a containerized and dynamically orchestrated platform (with the help of technologies like Kubernetes, Docker, etc.) in dynamic environments provided by the public, private and hybrid clouds. Though there can be numerous reasons why the companies are going Cloud-Native, some of the most important driving factors are significant application downtime reduction, high resiliency, dynamic scaling-up/scaling-down of cloud resource utilization as per the business needs, increased development velocity, having applications that are highly responsive to user demand, more focus on innovation and thus more business value addition.
Approach to Test Cloud-Native Applications
Testing has always helped us to dig deeper, reveal problems, and deliver high-quality products to the users. It plays a major role in helping us to gather a lot of useful information on the state, maintainability, performance, robustness, and reliability of the products. On analysis, this gathered information allows the decision-makers to decide on product shipment with more confidence 🚢
When it comes to testing the Cloud-Native applications, things become complex in comparison to the traditional approach that people used to take to test other applications (e.g. the monolithic ones). This is because Cloud-Native applications are more dynamic, distributed, built on Microservices (which can have independent releases), shipped at faster rates (often with CI/CD and DevOps practices) and there are failure modes that are difficult to anticipate and trace. This requires us to adapt to changes, review the traditional testing techniques, and include some new and modern ways to foresee, detect, respond, debug, analyze and report problems.
These testing techniques will help us in many ways to find and reveal lots of information which will help to increase the overall quality of the Cloud-Native applications. Thus, for such applications, testing should be an integral part of all the phases of the software development life cycle (both pre-production and post-production) and it is expected to trigger more conversations between the BAs/Developers/Testers/Architects/Designers where questions will be asked, the information will be gathered/shared, and problems/risks will be discussed and evaluated.
Let’s now go through techniques that can be used to effectively test the Cloud-Native applications 👇
Unit Testing, Integration Testing, End-to-End Testing
By testing the small granular parts of the individual testable service components in a Microservices architectural setup of the Cloud-Native applications, many issues (bugs) can be found early in the development life cycle. These fast, reliable and automated Unit Tests will not only ascertain whether the individual units/modules of the service components are working correctly or not, but will also help the developers to observe changes in the unit/module’s state, check the interaction between the objects inside them, and their dependencies, get quick feedback/information on the state of those components or whether something has regressed due to some code change or not. These tests will make the application code more testable and easy to refactor.
Once the service components are integrated, the Integration Tests, triggered by the CI servers, will help to test the communication paths and interactions between the individual service components or between the service components and some external services/systems/datastore. Though it will be difficult to test all the integration points, the teams have to take a risk-based approach and test with defined goals, scope, and tradeoffs.
It will be difficult to execute and run the comparatively larger End-to-End Tests for the Cloud-Native applications since it will involve testing every moving part of the Microservices architecture, will be slow, sometimes quite flaky, have to account for the asynchrony between the service components and environments, and hence can turn out to be a costly activity in terms of setup, running and maintenance. Still, the teams need to run a few of those tests, less frequently though, to cover some of the important user journeys and to verify that the complete application is meeting the business requirements.
Since individual and independent microservices will be involved, teams need to perform Contract Testing too. The Microservices architecture consists of “producer” service components and “consumer” service components. When a consumer will try to couple with the interface of a provider to use its output, a “service contract” (consisting of the expected input and output) will get created between them.
Automation test suites, consisting of these contract tests, can be integrated into the integration pipelines, and once run, they will verify whether any change in the provider or the consumer component is meeting the service contract between them or not. Creating and running contract tests is an important activity to test Cloud-Native applications.
Performing functional tests for the Cloud-Native applications is important as they ensure that the product is meeting the business requirements. But can they ensure and provide confidence that the product will respond in the desired manner when it is put into production? Can the product degrade gracefully when there is a sudden server crash or a service component goes down or some dependent services become unavailable? Will the product be secure enough when put into production? Can the product manage a sudden spike in user requests?
Testing these non-functional quality aspects is very important when the product is being built for the cloud. Any issue or deviation from the expected behavior in terms of these points need to be detected, debugged, and fixed as soon as possible with minimal effort and steps need to be taken so that those don’t happen again. To ensure that the chances of these happening in the production is less or the impact is minimal, with the help of good and useful tools (many are provided by the cloud vendors themselves), we have to test the product for performance (e.g. latency, the effect of load-balancing/caching/risk-conditions on the product performance, benchmark testing to compare and provide feedback on performance results against agreed performance metrics), usability, load (e.g. effect on product throughput under close-to-real load conditions) and security (both static and dynamic) and address any potential risk beforehand.
Chaos Engineering and Failure Mode Testing
When it comes to quality engineering, most of us are aware of techniques like “FMEA” (Failure Mode and Effects Analysis) which can help us to identify potential failure modes in a product and their cause and effect. For monolithic applications, most of the potential failure modes are known, can be identified, and thus can be handled in the code constructs or can be fixed quickly if not handled.
But for microservices, the number of ways the product can fail in production can be unlimited and unpredictable due to the large amount of complexity involved. In these cases, “Chaos Engineering” can be of much help. It is an approach to identify failures in production to build confidence in the system’s capability to respond to unexpected situations or unknown actions.
Together with FMEA, it can help us to have a more reliable and resilient product by injecting small controlled failures, making it possible for us to detect and analyze those, and thereby providing us a sense of what can possibly go wrong. This will help us to adjust the existing processes to prevent the cascading consequences and plan early for short MTTR (Mean Time to Recovery/Restore) of the product in case of failures.
Observability, Monitoring, and Log Analysis
As software engineers, we have to perform both pre-production and post-production testing for the Cloud-Native applications. If done correctly, testing in production can reveal a lot of valuable information for us and that information can act as important feedbacks when planning for resiliency, scalability, and flexibility for the next releases. But we have to keep in mind that these tests are complex to setup/execute and we must be careful when performing these tests and be aware of the effects on the business and users if such tests are not done correctly and securely.
One of the approaches which help us to understand the behavior of the software in production better is “Observability” (or “o11y”). It is defined as the approach to understanding the ability of the product’s internal states by observing its outputs. There are also “Monitoring” techniques and tools using which we can collect/store/maintain/query information related to the application state/behavior/interaction between the services in production, interpret and display them using metrics and logs. These logs and metrics can be analyzed further to gain valuable insights or to evaluate and debug issues quickly. Some of the cloud providers offer out-of-the-box features and tools to help with these activities.
We have to understand that no matter how much functional and non-functional testing we plan and do, how much we try to improve the quality of these Cloud-Native applications, the end-users will still face issues. The goal will be to reduce the risks of unexpected events, to analyze, debug and fix issues quickly, learn from the events and use that knowledge for the next releases.
Finding issues in production can be highly expensive and we should try to find them as early as possible in the development life cycle. In production, we can take advantage of Canary Deployments (rolling out of all features to a subset of users as an initial test), Dark Launches (rolling out of new/major features to a subset of users as an initial test), Smart Feature Toggles/Flags/Bits/Flippers (allowing specific features of an application to be activated or deactivated at will) to continue finding issues.
But we also have to keep in mind that doing exhaustive testing by applying all of the known testing strategies will not be possible due to various limiting factors related to budget, bandwidth, timelines, time-to-market, a large number of dependent/independent services, environment availabilities, etc. So the teams need to take a risk-based testing approach and they also have to be aware of the various type of costs involved with issues if those are found by users in production like detection cost, debugging cost, opportunity cost, fixing cost, verification cost, and maintenance cost.
Considering all the factors I have discussed in this article, it will be safe to conclude that – though testing Cloud-Native applications is difficult and challenging, we can bring our expertise and knowledge of the different testing techniques and strategies and combine them with new and modern ways to help deliver high-quality products to the users 🚀