Scaling Infrastructure for Growth
by Aaron Stahl on February 3, 2020
I joined Protenus in 2017, just as we were maturing from the startup to growth phase of the business. A key improvement identified to drive future growth during this period was the ability to decrease the time it took to onboard new customers. This presented some interesting infrastructure challenges as, until this point, hosts and resources in our Amazon Web Services (AWS) account had been manually provisioned and configured to meet requirements set forth by various HIPAA rules and our stringent security policies.
Almost immediately the DevOps team identified the need to re-think this process from the ground up. Dubbed “Infrastructure 2.0,” we set out to identify an architecture that would automate the provisioning of new clients to the simplicity of a single button click if possible. Eventually, after much research (and a few long nights), we decided to fully embrace Infrastructure as Code principles and adopt HashiCorp’s Terraform tool.
Utilizing Infrastructure as Code allowed us to define our cloud resources in a configuration language that could be versioned and stored in source control management (in our case, git) just like any other code. We could then work to apply good design principles from the twelve-factor app methodology to our infrastructure, creating repeatable infrastructure deployments that could be used to grow the business.
For our tooling, we chose Terraform over similar tools for a few reasons. First, Terraform was under very active development, gaining new features almost weekly. Secondly, it supported AWS and other providers, giving us plenty of options for the future. Providers such as Vault, Consul, Datadog, and Okta were pivotal components towards automating our full stack. Finally, Terraform had a large community user base in case we ran into any troubles and needed some help.
Slice and Dice
Once we had a process and a tool, we needed to design the architecture in a way that allowed us to grow for the future. This meant really thinking about the services we needed to provide our customers--both external customers and internal customers. We decided to break apart our infrastructure into logical pieces and separate them into individual git projects. Doing so afforded us the ability to modify only a portion of our infrastructure at a time, helping to reduce blast radius and (mostly) avoid monolithic code repos.
We ended up agreeing on three main “levels” of our architecture:
- Base Infrastructure: Base configuration of an AWS account--think VPC, routes, peering, and low-level configurations.
- Utility Applications: The behind-the-scenes apps that our applications and automation rely on, such as Jenkins, an ELK cluster, service discovery, and secrets database.
- Per-Client Applications: A cookie-cutter style set of apps that powers the Protenus platform for each and every client (more on this to come).
This level of our environment serves as the foundation upon which all other projects operate. At Protenus, this is one master repo that provisions all AWS accounts with an initial set of resources and configurations. Here we create the VPC, routes between regions and accounts, NACLs, subnets, IAM settings, and other low-level configurations so other projects can function. Using Terraform here is invaluable as we can easily audit how our accounts are provisioned, down to how we assign IP space to the subnets. It also ensures that we can easily audit our accounts as they are all identical and contain the necessary pre-configured audit trails.
The utility applications tier is perhaps my favorite level of our infrastructure. The apps contained within this level are often one-offs and sometimes overlooked. Jenkins is one such app, our repository for Jenkins allows us to build out this CI tool via helper scripts (classic chicken-and-egg problem). Once Jenkins is running, it serves as the foundation by which all other apps and stacks are created in our infrastructure.
Additional stacks in this tier include secrets management and service discovery clusters (Vault and Consul) as well as logging (ELK) and a host of security tooling. Even our container orchestration platform (ECS) is a utility application--we define a cluster per account, each of which is capable of running with a suitable configuration for its own needs. Because each of our tools is defined in its own repository, we can easily perform maintenance and updates without disruption to other components.
The most unique portion of our infrastructure is our per-client stacks. Early in the design for “Infrastructure 2.0,” we identified the need to horizontally scale to meet our growth needs. After all, coupling together stack components only leads to heartbreak when you hit a performance issue or there aren’t any larger EC2 instances to support your needs…
Here at Protenus, we created one “cookie-cutter” repo that handles all resources relating to a client. This Terraform repository contains multiple modules that can be enabled or disabled and configured on a client-by-client basis. We then store configs in a separate repository so that we can easily add more clients by simply adding a directory for them with associated settings. To reduce config duplication (and keep us sane), we keep a default config set and only require that clients override settings that deviate from these defaults.
Using this methodology, we can easily scale out to any number of clients by simply running this Terraform through Jenkins for the client. This produces a cell architecture for our client resources, isolating them and providing security and horizontal scalability.
Even with all of our tiering and segmentation of applications at Protenus, we continue to evolve. As our healthcare compliance analytics platform expands, we are constantly adding new features and apps to our infrastructure, affording us the opportunity to revisit old code and make improvements. Also, the Protenus team is growing, bringing a wealth of new ideas and fresh thinking about ways to gain efficiency or improve performance.
If you’ve got any questions or just want to talk about scalable infrastructure, reach out to me via LinkedIn or drop a comment down below.