Upgrade & Maintain

The Problem

Even if we codify our infrastructure, that doesn't mean our job is done. Time needs to be spent updating components, adding features, and fixing bugs. Over time this churn can create significant tech debt or worse, stagnation. Moreover, open source software isn't a silver bullet of 'free updates' ( ie. projects can become stale or abandoned ).

Our Solution

We discuss many of the common tasks, best-practices, and the means of automation, along with guidance on how to lean on that automation without overdoing it early on. The trick with ownership is to take things in stride, starting slow and building up automation.

What needs to be maintained

Making decisions

Read through your ADR documents and make sure that they are up to date. When you make a deliberate effort to change something, try in earnest to document those changes. This doesn't mean you need to write up components, but certainly express when one technology should be preferred over another, and discuss patterns that your team should adopt and reherse.

Good ADR Examples:

Technology choices (adding, removing)
Scaling up infrastructure (new orgs, new regions, etc)
Analysis, direction, or principles of patterns

Creating components

We have a separate guide on authoring components that can guide you through the methods, but we should also talk about the impact on maintenance separately.

Make sure your component doesn't already exist and that includes checking for PRs. Maximize the use of modules by searching the Terraform Registry and remember that there is a high cost to components:

Keep your providers up to date:
- GitOps and Spacelift solutions will warn you about deprecation in their logs. Read them and deligently create issues on them.
Dependencies should be carefully considered:
- Avoid mixing global and regional resources. Two smaller components will compose better
If you need to make many instances of a resource, consider drying that up with Atmos
- i.e. if you need to make 4 VPCs, then make 4 instances of a component that produces the VPC
- It is significantly easier to use Atmos for DRYing configuration rather than Terraform
Maintenance will include disabling/enabling components. Make sure that your component respects this flag or it could be very difficult to update and extend.
Consider versioning and maintaining components outside of your infrastructure repo. If you plan for other organizations to use your component, make sure you practice vendoring.

Updating components

Components in Atmos support vendoring. This means you can version them independently of your infrastructure to best manage the operational cost of updating them. Make sure you read up on how vendoring works in Atmos and carefully read release information for risks and breaking changes.

Updating infrastructure

When you are working on altering your stacks folder, Atmos has several features to help manage the sprawl. Be sure to read up on how Atmos manages stacks.

Some key patterns for success while maintaining stacks:

Validate stacks often while configuring them
Use the describe command to look at imported files
Try to dry up catalog entries after atmos terraform plan is working, not before
- Often, catalog patterns emerge once your components are configured in many environments
- Mixin and layer patterns emerge over many PRs and with maturity. Rushing them can lead to significant tech debt

Operational Headaches

Some situations you should plan for include:

Expect atmos terraform destroy to fail. Test with enabled=false, then destroy.
Updating runners for things like GitOps, GitHub Actions, or Spacelift can be a catch-22. Carefully consider that while you replace them, they could destroy themselves or otherwise mess up state locks.
What order of operations does a set of infrastructure pieces take?
- Document all required clickops. Many apis like AWS still have these
- Tools like Spacelift understand dependencies. You can use Atmos to make sure they are tracked
- Consider Atmos Workflows when steps need to manage resources in peculiar fashions such as using the Terraform -target flag
It's always easier to add/remove than to mutate. Prefer replacing components whenever you are making complex changes.
If availability or global dependencies are a concern:
- ADR Docs should be present to discuss risks and describe how they are mitigated
- Consider using a new stage. The AWS Well-Architected and 12-factor patterns go over the patterns of a good platform. You are maintaining a platform.

Secrets rotation

Simply put, SSM Parameter Store is very helpful, but it won't let you know about rotation and drift.

Consider using developer automation features in 1Password to help with secret rotation
You can also use an ADR to document credentials that should expire and when
Atmos workflows can be used to rotate secrets
Terraform has the time_rotating resource
Make sure you are using bot accounts or applications for GitHub secrets. If any of the tokens in your 1Password vault are personal, there will be foreseeable problems. You can use the gh auth status command from the gh cli to verify the user of each token.

Automation and tooling

Renovate

Renovate is a swiss-army knife for keeping abreast of changes in open source software. Some leading patterns and best practices include:

Renovate can watch for releases and notify you on a dashboard
Renovate can watch Dockerfiles
Renovate can notify EndOfLife cycles
- Make sure you consider your platform, such as Kubernetes or AMI distributions
- Consider the aforementioned "dashboard" feature so you avoid alert fatigue
Renovate can watch terraform modules
Renovate can watch terraform providers

Since it can be daunting to configure Renovate for everything, we recommend starting with only the basic and most crucial sources of tech debt:

Make sure Geodesic updates create PRs
- You'll get a lot of automated updates from this alone, including patches to Terraform and aws-cli
Create module and provider rules for custom components

UpdateCLI

UpdateCLI is a tool that can be used to update many different types of software and can implement auto-discovery. While the configuration is more complex than Renovate, it can be customized to do much more in-depth automation.

Considerations:

Auto-discovery quickly leads to alert fatigue. Consider it for high churn
Updating stacks is possible, and you can even update AMI searches or db versions, but make sure you have a good understanding of the impact of the change before you automate it.

Atmos Component Updater

Atmos has a Component Updater which can be enabled as a GitHub action.

The Atmos Component Updater will automatically suggest pull requests in your new repository. To do so, we need to create and install a GitHub App and allow GitHub Actions to create and approve pull requests within your GitHub Organization. For more on the Atmos Component Updater, see atmos.tools.

Ensure all requirements are met.
Set up a Github App with permission to create Pull Requests. We use a GitHub App because Pull Requests will only trigger other GitHub Action Workflows if the Pull Request is created by a GitHub App or PAT.
Create a GitHub Environment. With environments, the Atmos Component Updater workflow will be required to follow any branch protection rules before running or accessing the environment's secrets. Plus, GitHub natively organizes these Deployments separately in the GitHub UI

Maturing infrastructure

Many of the topics above concern maturing infrastructure. As you grow, you will find many patterns in how your platform responds to business needs.

This takes time.

Make sure that you retro your platform regularly. Patterns to consider for maturing your infrastructure include:

Monthly meetings to sync on tech debt, outages, and vulnerabilities
Rotating ownership of components
Reviewing telemetry and auditing PRs

References

FAQs

How can I quickly patch a newly vendored component?

We recommend using Terraform override files to quickly patch a component.

What if the resources in my component need to move after vendoring?

Consider using Terraform moved configuration, understanding that the state commands can also be codified in Atmos workflows.

Should I teach my infrastructure to update itself?

It's best to first do as much manual work as possible. Once you feel like you have a well analyzed pattern, consider making a PR to add an ADR and discuss. If the ADR holds up to criticism, it should encapsulate what you plan to automate.

Developers want to iterate on infrastructure. How do I manage this?

If developers want to use your platform in a way that affects terraform state:

Can these resources be released from state? Then give developers access with aws-teams
If developers want to codify their own infrastructure outside of your platform:
- Do they just need extra environments? Codify them using Atmos template imports
- Can they manage components in a separate repo? Then vendor their component repo. They can use the sandbox account to test their components.

Mostly, the platform you make will need room to iterate, but this can get costly quickly. Make sure to start small and set goals to drive when you can increase the cost of the platform.

The Problem​

Our Solution​

What needs to be maintained​

Making decisions​

Creating components​

Updating components​

Updating infrastructure​

Operational Headaches​

Secrets rotation​

Automation and tooling​

Renovate​

UpdateCLI​

Atmos Component Updater​

Maturing infrastructure​

References​

FAQs​

How can I quickly patch a newly vendored component?​

What if the resources in my component need to move after vendoring?​

Should I teach my infrastructure to update itself?​

Developers want to iterate on infrastructure. How do I manage this?​