Azure infrastructure and application testing

Introduction to quality assurance

Quality assurance (QA) is an umbrella term which encompasses all application software and infrastructure quality tests which are performed to verify software application and infrastructure functional and non-functional attributes, including performance, scalability, availability and security. Azure infrastructure and application testing is a major part of quality assurance. This article includes a discussion on Azure infrastructure and application testing.

There are QA frameworks to be used for both software application and cloud infrastructure testing. Whether you perform QA using testing frameworks or you manually perform a baseline of tests, you need to have test scripts available. Each test script describes the steps to be carried out to test a certain aspect of an application or infrastructure system, as well as the result of each test and subsequent action items, if needed.

This article provides a baseline reference of the steps to be included in each type of cloud infrastructure or software application test script alongside with a testing script template available for download. The article narrows down the discussion of quality assurance into Azure infrastructure and application testing.

Azure infrastructure and application testing types

The following types of Azure infrastructure and application testing types are available:

Application unit testing. A unit test is a test that exercises individual software components or methods, also known as a "unit of work." Unit tests should only test code within the developer's control. They don't test infrastructure concerns. Infrastructure concerns include interacting with databases, file systems, and network resources. xUnit.net is a free, open source, community-focused unit testing tool for the .NET Framework.
Application integration testing. An integration test differs from a unit test in that it exercises two or more software components' ability to function together, also known as their "integration." These tests operate on a broader spectrum of the system under test, whereas unit tests focus on individual components. Often, integration tests do include infrastructure concerns.
Application load (or stress) testing. A load test aims to determine whether or not a system can handle a specified load. For example, the number of concurrent users using an application and the app's ability to handle interactions responsively. Refer to my KB article on Azure App Service load testing (using JUnit) for an example of carrying out load testing for an Azure-based Web application.
Infrastructure and application functional testing. This aims to test all operating systems, firmware and the functional attributes of on-prem or cloud infrastructure components involved in a solution.
Application performance and scalability testing. During performance and scalability testing, you measure the performance metrics of your infrastructure as your application workload increases. This is combined with application software stress/load testing. The performance and scale tests involve setting up and testing autoscale policies for scaling horizontally or vertically when the application load requires it.
Infrastructure and application availability testing. Availability tests mainly focus on testing the resiliency of your infrastructure in the event of hardware or software components failures and the associated potential downtime incurred during these failures. This should test your infrastructure and application code high availability, redundancy and resiliency. Resiliency can come in many forms, including load balancing, failover clustering (active-active or active-passive) and always on availability groups, virtual machine availability sets as well as cloud infrastructure fault domains, update domains, cloud zone and cloud region redundancy. The availability tests focus on creating deliberate failure simulation events, such as for example having an Azure VNET or subnet or storage account going down, to check the impact on the overall architecture.
Infrastructure and application security testing. Security testing aims to test the attack surface of your architecture, the resiliency against malicious attacks, including ransomware, malware and hacking attempts and to discover potential vulnerabilities and security weaknesses in your security configuration by employing techniques such as penetration testing to evaluate IDS/IPS, Web Application Firewall protection, SIEM/SOAR/XDR systems. Azure Advisor, Azure Security Center, Azure Advisor, Microsoft Defender for Cloud and Azure Sentinel are all good tools to employ for security testing.

Site Reliability Engineering (SRE)

The success of your cloud solution depends on its reliability. Reliability can be broadly defined as the probability that the system functions as expected, under the specified environmental conditions, within a specified time. Site reliability engineering (SRE) is a set of principles and practices for creating scalable and highly reliable software systems. Increasingly, SRE is used during the design of digital services to ensure greater reliability. The degree of reliability that's required for a solution depends on the business context.

In the context of Azure infrastructure and application testing, reliability is defined and measured using service level objectives (SLOs) that define the target level of reliability for a service. Achieving the target level assures that consumers are satisfied. The SLO goals can evolve or change depending on the demands of the business. Another important term to note is service level indicator (SLI), which is the metric that's used to calculate the SLO. SLIs are based on insights that are derived from data that's captured as the customer consumes the service.

SLIs are always measured from a customer's point of view. SLOs and SLIs always go hand in hand, and are usually defined in an iterative manner. SLOs are driven by key business objectives, whereas SLIs are driven by what's possible to be measured while implementing the service. The relationship between the monitored metric, the SLI, and the SLO is depicted below:

Azure cloud infrastructure and application testing

Refer to the Reliability and Performance Efficiency pillars of Azure Well Architected Framework for guidance on building scalable and reliable applications.

Chaos Engineering and Azure Chaos Studio

Chaos engineering is a methodology that helps developers attain consistent reliability by hardening services against failures in production. Another way to think about chaos engineering is that it's about embracing the inherent chaos in complex systems and, through experimentation, growing confidence in your solution's ability to handle it.

A common way to introduce chaos is to deliberately inject faults that cause system components to fail. The goal is to observe, monitor, respond to, and improve your system's reliability under adverse circumstances. For example, taking dependencies offline (stopping API apps, shutting down VMs, etc.), restricting access (enabling firewall rules, changing connection strings, etc.), or forcing failover (database level, Azure Front Door, etc.), is a good way to validate that the application is able to handle faults gracefully.

Azure Chaos Studio is a managed service that uses chaos engineering to help you measure, understand, and improve your cloud application and service resilience. Chaos engineering is a methodology by which you inject real-world faults into your application to run controlled fault injection experiments. Chaos Studio helps you avoid negative consequences by validating that your application responds effectively to disruptions and failures. You can use Chaos Studio to test resilience against real-world incidents, like outages or high CPU utilization on virtual machines (VMs).

In Azure Chaos Studio, you first select and enable your targets (either service-direct targets or agent-based targets, such as VMs and VM Scalability Sets).

Then you create Chaos experiments on selected Azure resources, as shown in the example below. Each experiment can introduce either faults or delays, to test the resiliency of the overall infrastructure in relation to the tested Azure resources.

Infrastructure as Code (IaC) testing frameworks and tools

The most prominent Infrastructure As Code languages in Azure are Azure ARM (JSON), Bicep and Terraform. Each of them has its own methods and tools for testing the IaC configuration in the framework of Azure infrastructure and application testing.

Terratest is a tool to be used for testing Terraform modules. Terratest is implemented as a Go library. Terratest provides a collection of helper functions and patterns for common infrastructure testing tasks, like making HTTP requests and using SSH to access a specific virtual machine.
Bicep code can be tested by using Azure Pipelines or Powershell PSRule or even combine PSRule in Azure Pipelines.
ARM templates can be tested by employing the ARM template test toolkit. The Azure Resource Manager template (ARM template) test toolkit checks whether your template uses recommended practices. When your template isn't compliant with recommended practices, it returns a list of warnings with the suggested changes. By using the test toolkit, you can learn how to avoid common problems in template development. You can also test ARM Templates by using Pester and Azure DevOps.
Use Pester for Powershell scripting or Powershell DSC scripts.
Azure Policy Test Framework is a command line tool to test Azure Policy relying on Terraform + Golang.

Testing script template for Azure infrastructure and application testing

You can find a free testing script template to be utilized for Azure infrastructure and application testing in the "Free Downloads" section.