Using elastic scaling to achieve fast smoke tests

Superfast and Reliable Automation Testing Platform

Try Now

Updated on December 13, 2018 by Mark Sawers Smoke Tests, Software Testing

Abstract

Webomates is preparing to release Boost, a new feature that speeds up smoke tests to execute an infinite number of test cases in 15 minutes. It eliminates the classic tradeoff between speed and coverage. To achieve this, we redesigned our SaaS architecture to incorporate elastic scaling on Amazon Web Services (AWS).

Background

Software engineers classify ‘smoke’ tests as a quick health check on one or more software components, typically after a deployment, but also after a significant configuration change. This avoids wasted effort in moving a failed build or configuration change down the pipeline. After the successful smoke test, the engineers execute more extensive integration test, feature or regression test.

But creating a good smoke test has always been a challenge—the ‘quick’ and the ‘check’ are at odds. Since most of us want a smoke test to complete in minutes, not hours, we typically trade off software coverage for execution speed. And if a smoke test is at the system or application level, we are usually talking GUI tests, i.e. slow relative to unit or API tests. We have to be very choosy. In our experience, system smoke test suites typically execute test cases numbering in the single digits, certainly not in the hundreds.

Introducing Boost

At Webomates, we didn’t like that tradeoff any more than you do, so we broke out of that box. We built a SaaS that executes hundreds (and soon thousands) of GUI and API tests in under 15 minutes. This is an add-on feature to the CQ (Continuous Quality) service we call Boost. This article describes how we built this capability on Amazon Web Services (AWS)

Boost-enabled cycles can be launched by a human from the CQ Portal or by a Continuous Integration/Continuous Deployment platform, e.g. Jenkins or Bamboo, via the Webomates CI/CD API. You can monitor the progress in real-time; results show live in the CQ Portal, or the CI/CD requester can poll periodically. Since the smoke mode is totally automated, we won’t have our engineers validating defects. You will analyze results at the end and decide to proceed with your promotion or go back and fix issues, and try again.

The key differentiator between pre-Boost and Boost, is the management of the test nodes, or in the aggregate, the test fleet. The test fleet before Boost was non-elastic or static—manually provisioned to typically two nodes per customer product. The test fleet was almost always idle. Now with Boost, the fleet is elastic or dynamically created and destroyed as needed, on a per test cycle basis.

Fundamentally Boost was a scaling challenge. If we want something to go faster, we need more compute power, i.e. scale up (more nodes of the same size) or out (same number of nodes but each bigger). We used cloud provider’s compute elasticity and a test domain-specific architecture to solve this challenge.

Elastic Scaling

First, let’s define elasticity more precisely:

Originating from the field of physics and economics, the term elasticity is nowadays heavily used in the context of cloud computing. In this context, elasticity is commonly understood as the ability of a system to automatically provision and deprovision computing resources on demand as workloads change. [1]

This is more than just scalability; it is autonomic scaling, or the capability of the system to self-regulate [2]. Pre-cloud systems typically scaled vertically, meaning more resources (processor, memory, storage, etc.) where added to individual hosts. This activity is typically manual, since this involved physical hardware modifications. Data center virtualization made vertical scaling simpler, but not fast or dynamic enough to make this real-time adaptive. Cloud systems horizontally scale, meaning homogenous compute instances (usually virtual machines) are added or removed, using sensors and rules.

Amazon Web Services (AWS) popularized the term autoscaling based on the capability in their core EC2 service. An autoscaling system can react to activity stimulus, for example, cpu activity level or inbound request rate, or proactively change based on a week day and time schedule.

Webomates fleet sizing inputs aren’t simple metrics like current HTTP request volume or CPU, rather they are more domain-specific: the number of current test cycles, and the size of those cycles. There are inherent challenges: forecasting execution time per test case, determining compute needs and provisioning it, allocating computation to compute nodes, not to mention orchestrating all of this activity. We blended the elastic scaling capabilities of AWS (Amazon Web Services) with our existing automation platform, and added some special sauce. The key technology component of this sauce is the test case Scheduler. It manages the forecasting and allocation of test cases to test nodes in order to meet the execution target.

Before we get into that, let’s look at first at how we previously ran automation tests.

Non-Elastic Fleets

A request for a new test cycle from the API or Portal is sent to an Orchestrator. The Orchestrator is responsible for setting up the infrastructure, starting the tests when all infrastructure elements are ready, collecting results, and on completion, tearing down that infrastructure.

The infrastructure consists of:

The Executor (our script engine)
The Distributor (distributes requests to application instances)
The Test Fleet (hosts the target applications)

The Executor runs the test scripts, sends commands and queries to the Distributor and as it executes each test case, sends results back to the Orchestrator. The Distributor works in the centralized hub-and-spoke model. As test nodes boost up, the resident Distributor agents register with the hub, announcing its availability. Once all nodes are up, the Distributor handles allocation of work to free nodes, the work being commands and queries.

Diagram 1: Non-Elastic Fleet Start

The test node could be virtual or physical and runs some native or browser-based application. In this article we will focus on virtual machines (VM) running browser-based applications. We run multiple copies of the browser at the same time on a mid-level VM, as it is more efficient cost-wise than a single application on small VMs.

Diagram 2: Non-Elastic Fleet Stop

The approach works but performance is capped by limited compute resources. Expanding the number of instances add labor as well as vendor costs (even stopped EC2 instances incur storage fees). So let’s look at the cloud-native approach.

Elastic Fleets

With Boost, test fleets are created (scaled out) on demand. The Scheduler provides the intelligence to launch fleets and allocate test cases to nodes in that fleet.

Diagram 3: Elastic Fleet Launch

And when the cycle completes, all the infrastructure is destroyed (scaled in to zero). No test nodes are ever idle.

Diagram 4: Elastic Fleet Teardown

We faced three design challenges:

Platform Management

Instead of per customer product images, we created a collection of common images based on a ‘Platform’: a unique Operating System and Browser combination, for example Windows 7 / IE 11, and Windows 10 / Chrome 65. Note that a test cycle could spin up more than one Test Fleet if it requires multiple Platforms. We also achieved a higher density of browser instances per node through experimentation and tuning.

Forecasting and Allocation

Forecasting flows backwards from the SLA. We take the execution window, and subtract the overhead of the prepare (infrastructure setup) and finalize (infrastructure teardown and analysis) phases. Our 15 minute SLA is reduced to approximately 12 minutes. All test cases now have to complete in 12 minutes, less a safety buffer.

Performance Management

We knew during design that meeting the SLA consistently was going to be a challenge. We built in lots of telemetry—recorded metrics to enable detailed audit and triage. The prepare and finalization phase does have variation, but the bulk of it is in the execution phase.

Dynamic configuration also helped us tune iteratively—we can change global and per-test suite scheduling computational parameters without restarting/redeploying the system.

Comparing Approaches

Looking at the two approaches, we see that there are clear advantages to the elastic approach:

Approach	Costs	Technical Complexity	Speed	Scale
Non-Elastic Fleets	Ongoing Onboarding	Medium	Slow	Medium
Elastic Fleets	One-time R&D	High	Fast	High

The cloud provider costs are roughly even! Although we need more compute resources for infrastructure (orchestration, scheduling and execution), we need less long term storage since there are no stopped/idle test nodes. The old approach had platform creation costs incurred during new application onboarding. The primary costs of the new platform feature were one-time R&D.

The benefits of the new elastic approach are clear: speed and scale.

Conclusion

Engineering truly elastic systems requires not just a great cloud provider (services, tools, cost controls, and course reliability) but also resilient and scalable application architecture. We faced challenges along the way, namely with platform optimization, scheduling, and performance management. Ultimately we succeeded in juicing up Webomates automation service performance. By shortening execution times while maintaining coverage, Boost-enabled CQ Service helps customers reduce overall cycle times, enabling more frequent, higher quality deployments.

References

[1] “Elasticity in Cloud Computing: What It Is, and What It Is Not”, Herbst, et. al. https://sdqweb.ipd.kit.edu/publications/pdfs/HeKoRe2013-ICAC-Elasticity.pdf

[2] “Autonomic Computing”, Wikipedia,

https://en.wikipedia.org/wiki/Autonomic_computing

Attributions

Orchestrator icon: Load Balancer Vector Image

Chrome logo: http://www.myiconfinder.com/icon/google-chrome-web-metro-ui-browser-browsers-browsing-internet-network-social-media/8353
Hub icon: CC Attribution needed: https://www.iconsdb.com/gray-icons/hub-icon.html

Spread the love

Tags: Smoke Tests, Software Testing

1 reply on “Using elastic scaling to achieve fast smoke tests”

Like!! Thank you for publishing this awesome article.

Overview

DevOps

Code Coverage

Media & Telecom

AiHealing^®

Small and Medium Business

Intelligent Test Automation

From CEO's Desk

Events

Blog

Our Team

Press Release

Careers

Using elastic scaling to achieve fast smoke tests

1 reply on “Using elastic scaling to achieve fast smoke tests”

AT&T's Success Formula: Download Our Whitepaper Now!

Get VIP Access to our Product Updates

Superfast and Reliable Automation Testing Platform

Overview

DevOps

Code Coverage

Media & Telecom

AiHealing®

Small and Medium Business

Intelligent Test Automation

From CEO's Desk

Events

Blog

Our Team

Press Release

Careers

Are you looking for a codeless test automation tool?

Are you looking for a codeless test automation tool?

Abstract

Background

Introducing Boost

Elastic Scaling

Non-Elastic Fleets

Elastic Fleets

Comparing Approaches

Conclusion

References

Attributions

1 reply on “Using elastic scaling to achieve fast smoke tests”

Leave a Reply Cancel reply

AT&T's Success Formula: Download Our Whitepaper Now!

AiHealing^®