This article has been originally posted on Swiftype Engineering blog.
For any modern technology company, a comprehensive application test suite is an absolute necessity. Automated testing suites allow developers to move faster while avoiding any loss of code quality or system stability. Software development has seen great benefit come from the adoption of automated testing frameworks and methodologies, however, the culture of automated testing has neglected one key area of modern web application serving stack: web application edge routing and multiplexing rulesets.
From modern load balancer appliances that allow for TCL based rule sets; local or remotely hosted varnish VCL rules; or in the power and flexibility that Nginx and OpenResty make available through LUA, edge routing rulesets have become a vital part of application serving controls.
Over the past decade or so, it has become possible to incorporate more and more logic into edge web server infrastructures. Almost every modern web server has support for scripting, enabling developers to make their edge servers smarter than ever before. Unfortunately, the application logic configured within web servers is often much harder to test than that hosted directly in application code, and thus too often software teams resort to manual testing, or worse, customers as testers, by shipping their changes to production without edge routing testing having been performed.
In this post, I would like to explain the approach Swiftype has taken to ensure that our test suites account for our use of complex edge web server logic
to manage our production traffic flow, and thus that we can confidently deploy changes to our application infrastructure with little or no risk.
Our Web Infrastructure
Before I go into details of our edge web server configuration testing, it may be helpful to share an overview of the infrastructure behind our web services and applications.
Swiftype has evolved from a relatively simple Rails monolith and is still largely powered by a set of Ruby applications served by Unicorn application servers. To balance traffic between the multitude of application instances, we use Haproxy (mainly for its observability features and the fair load balancing implementation). Finally, there is an OpenResty (nginx+lua) layer at the edge of our infrastructure that is responsible for many key functions: SSL termination and enforcement, rate limiting, as well as providing flexible traffic management and routing functionality (written in Lua) customized specifically for the Swiftype API.
Here is a simple diagram of our web application infrastructure:
Testing Edge Web Servers
Swiftype’s edge web server configuration contains thousands of lines of code: from Nginx configs to custom templates rendered during deployment, to complex Lua logic used to manage production API traffic.Any mistake in this configuration, if not caught in testing, could lead to an outage at our edge, and considering that 100% of our API traffic is served through this layer, any outage at the edge is likely to be very impactful to our customers and our business. This is why we have invested time and resources to build a system that allows us to test our edge configuration changes in development and on CI before they are deployed to production systems.
Testing Workflow Overview
The first step in safely introducing change is ensuring that development and testing environments are quarantined from production environments. To do this we have created an “isolated” runtime mode for our edge web server stack. All changes to our edge configurations are first developed and run in this “isolated” mode. The “isolated” mode has no references to production backend infrastructure, and thus by employing the “isolated” mode, developers are able to iterate very quickly in a local environment without fear of harmful repercussions. All tests are written to run as part of the “isolated” mode employ a mock server to emulate production backends and primarily focus on the unit-testing of specific new features that are being implemented.
When we are confident enough in our unit-tested set of changes, we could run the same set of tests in an “acceptance testing” mode when the mock server used in isolated tests is replaced with an Haproxy load balancer with access to production networks. Working on tests and running them in this mode allows us to ensure with the highest degree of certainty that our changes will work in a real production environment since we exercise our whole stack while running the test suite.
Testing Environment Overview
Our testing environment employs Docker containers to serve in place of our production web servers. The test environment is comprised of the following components:
- A loopback network interface on which a full complement of production IPs are configured to account for every service we are planning to test (e.g. a service foo.swiftype.com pointing to an IP address 10.1.0.x in production is tested in a local “isolated” testing environment with IP 10.1.0.x assigned to an alias on the local loopback interface). This allows us to perform end-to-end testing: DNS resolution, TCP service connections to a specific IP address, etc. without needing access to production, nor local /etc/hosts or name resolution changes.
- For use cases where we are testing changes that are not represented in DNS (for example, when preparing edge servers for serving traffic currently handled by a different service), we may still employ local /etc/hosts entries to point the DNS name for a service to a local IP address for the period of testing. In this scenario, we ensure that our tests have been written in a way that is independent of the DNS configuration, and thus that the tests can be reused at a later date, or when the configuration has been deployed to production.
- An OpenResty server instance with the configuration we need to test.
- A test runner process (based on RSpec and a custom framework for writing our tests).
- An optional Mock server. (As noted above, this might be docker in a local test environment, or in CI, and is likely to be used as part of the test runner process, where it emulates an external application/service; serves in place of a production backends; or acts as a local Haproxy instance running a production configuration and may even route traffic to real production backends.
Isolated Testing Walkthrough
Here is how a test for a hypothetical service foo.swiftype.com (registered in DNS as 220.127.116.11) is performed in an isolated environment:
- We automatically assign 18.104.22.168 as an alias on a loopback interface.
- We start a mock server listening on the localhost configured to respond on the same port used by the foo.swiftype.com Nginx server backend (in production, there would be haproxy on that port) with a specific stub response.
- Our test performs a DNS resolution for foo.swiftype.com, receives 10.1.0.x as the IP of the service, connects to the local Nginx instance listening on 10.1.0.x (bound to a loopback interface) and performs a test call.
- Nginx, receiving the test request, performs all configured operations and forwards the request to a backend, which in this case is handled by the local mock server. The call result is then returned by Nginx to the test runner.
- The test runner performs all defined testing against the server response: These tests can be very thorough, as the test runner has access to the server response code, all headers, and also the response body, and can thus confirm that all data returned meets each test’s specifications before concluding if the process as a whole has passed or failed test validation.
- Specific to isolated testing: In some use cases, we may validate the state of the Mock server, verifying that it has received all call we expected it to receive and that each call represented the data and headers expected. This can be very useful for testing changes where our web layer has been configured to alter requests (rewrite, add or remove headers, etc.) prior to passing them to a given backend.
Here is a diagram illustrating a test running in an isolated environment:
Acceptance Testing Walkthrough
When all of our tests have passed in our “isolated” environment, and we want to make sure our configurations work in a non-mock, physically “production-like” environment (or during our periodic acceptance test runs that must also run in a production mirroring environment), we use an “acceptance testing” mode. In this mode, we replace our mock server with a real production Haproxy load balancer instance talking to real production backends (or a subset of backends representing a real production application).
Here is what happens during an acceptance test for the same hypothetical service foo.swiftype.com (registered in DNS as 22.214.171.124):
- We automatically assign 126.96.36.199 as an alias on a loopback interface.
- We start a dedicated production Haproxy instance, with a configuration pointing to production backend applications, and bind this dedicated haproxy instance to localhost. (This exactly mirrors what we do in production, where haproxy is always a dedicated localhost service).
- Our test performs DNS resolution for foo.swiftype.com, receives 10.1.0.x as the IP of the service, connects to a local Nginx instance listening on 10.1.0.x (bound to a loopback interface), and performs a test call.
- Nginx, receiving a test request, performs whatever operations are defined and forwards it to a local Haproxy backend, which in turn sends the request to a production application instance. When a call is complete, the result is returned by Nginx to the test runner.
- The test runner performs all defined checks on the response and defines whether the call and response are identified as passing or failing the test.
Here is a diagram illustrating a test call made in an acceptance testing environment:
Using our edge web server testing framework for the past few years, we have been able to perform hundreds of high-risk changes in our production edge infrastructure without any significant incidents being caused by the deploying of an untested configuration update. Our testing framework provides us the assurances we need, such that we can make very dramatic changes to our web application edge routing (services that affect every production request) and that we can be confident in our ability to introduce these changes safely.
We highly recommend that every engineering team tasked with building or operating complex edge server configurations adopt some level of testing that allows the team to iterate faster without fear of compromising these critical components.