The GOV.UK Replatforming team are working on rehosting GOV.UK. We're moving from AWS EC2 to Fargate, a serverless compute engine, with Elastic Container Service (ECS) or Elastic Kubernetes Service (EKS) as the orchestration service.
One goal of this migration is to reduce the toil - work that is manual, repetitive, provides no enduring value and we can automate. And as we manage the infrastructure for around 50 applications, we have plenty of work classified as toil.
Orchestration over Puppeteering
- set up and control every server GOV.UK runs on
- configure our monitoring and alerting (Collectd, Statsd, Graphite, Nagios, Icinga, Grafana)
- back up databases
- configure app environments
Moving from running virtual machines (VMs) to containers requires a change to our configuration management toolchain. In the future, we'll stop using Puppet to configure long-lived VMs. Instead, we'll use Terraform and OCI images to run short-lived containers on Fargate.
Opportunity for improvements
Replacing our Puppet configuration gives us an opportunity to improve large parts of the infrastructure behind GOV.UK.
A big source of toil is how we manage credentials with Puppet.
The majority of the apps we run require credentials such as API tokens and database passwords. Puppet is responsible for configuring apps and, since we like to run 12-factor apps, credentials are provided to apps through environment variables. All secrets provided to apps are managed in a git repository using heira-eyaml which is read at deploy time by Puppet.
Internal API requests between apps (such as a request to publish a page on GOV.UK) often require an authorization token. Tokens are issued to applications by Signon, our OAuth 2.0 authorization server.
Setting an expiry time on authorization tokens is recommended in the OAuth 2.0 specification. If a token is only valid for a short time this can reduce the impact should the token be leaked.
Unfortunately, adding, rotating, and deleting a secret is a manual process. Rotating a secret involves:
- Generating a new token
- Updating the token in hiera-eyaml
- Having a code review process
- Deploying Puppet to our Integration, Staging, and Production environments
This process is manual, repetitive, reactive, and scales linearly as we add more applications and app secrets. When credentials must be routinely created or rotated, this distracts developers from more valuable engineering work.
This process creates a significant amount of toil, as we have around 1,800 secrets in hiera-eyaml across all environments. Moreover, there's always a temptation to workaround the toil that rotation brings by creating authorisation tokens with a very long validity period.
Manually generating secrets is particularly painful when bringing up a new environment of GOV.UK, as we have needed to do when building a prototype version of GOV.UK in ECS.
We're also exploring providing cloud-hosted developer environments, to enable developers to spin up a personal, isolated instance of GOV.UK. Unfortunately, creating an instance of GOV.UK requires manually generating many new secrets, which makes spinning up new environments on the fly really expensive.
Automating secret creation
Rather than manually manage thousands of credentials, we've decided to automate the creation and provisioning of secrets.
We've chosen AWS SecretsManager as the replacement for our git-based secret store. SecretsManager integrates tightly with ECS and is a supported provider for Kubernetes Secrets Store.
Signon, GOV.UK's OAuth2 based single sign-on provider, generates about a third of our credentials for applications.
Currently around 600 credentials are issued by Signon, and are manually created and rotated in our git-based secret store.
We replaced this manual process by automatically generating Signon resources such as OAuth applications, users, and authorisation tokens upon creation of a GOV.UK namespace.
We considered implementing a Terraform provider for Signon. We chose not to do this because it is quite easy for credentials handled by Terraform to be stored in the Terraform state file.
Once an application's secrets have been set in SecretsManager, we deploy the app to ECS. When tasks are started, ECS pulls secrets from SecretsManager through a native integration
We also found that around 10% (178) of our credentials are the Rails-specific secret_key_base. These have until now been generated manually for each instance of GOV.UK. To save us from creating these credentials manually, we introduced a job into our CD system to generate app secret_key_base secrets during bootstrapping.
Around 15% (300) of our credentials provide access to AWS resources. We expect these to be replaced by AWS IAM (Identity and Access Management) roles or instance profiles in future.
The final third of our secrets are miscellaneous credentials for third-party services such as Sentry and GOV.UK Notify. The patterns we introduced into our CD system should make it simpler to automatically generate credentials in the future for services that enable this.
Automating secret rotation
Generating secrets automatically is useful for bootstrapping a GOV.UK namespace. To further reduce toil from managing credentials, we've begun automating our credential rotation processes. Signon authorisation tokens are currently the only credential that automatically expires, so we have started there.
When authorisation tokens are near expiry, SecretsManager triggers a Rotation Lambda that interacts with Signon and SecretsManager to issue a new secret version.
While the process roughly follows the recommended process in the AWS documentation, there are a couple of rough edges to the Rotation Lambda that we needed to work around.
One problem we hit is that ECS does not automatically restart Tasks when their secrets have been rotated. To solve this, we have the Rotation Lambda record an event when it has rotated a credential, which triggers a redeployment by our CD system.
While this is a problem that must be solved idiosyncratically when using ECS, the Kubernetes Secrets Store CSI Driver has a Rotation Reconciler feature in alpha which might solve this problem for Kubernetes.
Another problem with the Rotation Lambda is that the SecretsManager trigger has a conservative retry strategy which is not configurable. If an initial createSecret call does not succeed within 5 minutes, the Lambda will be retried the following day. We found that this can cause our bootstrapping process to stall, since the Lambda could fail to create authorisation tokens.
To work around this, our CD system performs the initial token creation step during bootstrapping. Ideally, the Lambda would be responsible for both creating and rotating authorisation tokens, so we hope that this will be a temporary workaround!
So far we've automated the generation of 40% of the credentials used by GOV.UK's applications and infrastructure. As we do other work, such as replacing AWS credentials with IAM roles, we expect that this will increase to around 70% by the completion of our project.
This has made it easier to automatically bring up new, isolated instances of GOV.UK's infrastructure, helping developers to be more productive. This will also make our credential rotation process more robust.
We're excited by the opportunity to further reduce the toil that we experience when managing GOV.UK's infrastructure.
Visit the GDS careers page if you're interested in coming and working with us!