Operational Readiness
WORK IN PROGRESS
This is an estimation of the operational readiness required to claim ownership over the systems and processes. Depending on your roles and responsibilities, the level of ownership changes.
If you’re new to all of this, you will want to start with Choose Your Path.
Organization
As an organization, you’ve conducted the following
-
Production Readiness Review (PRR) https://sre.google/sre-book/evolving-sre-engagement-model/#the-prr-model-DVsGhWZ
-
Roles, Responsibilities, & Ownership (RACI) for all Components in-use
-
Ensure you’ve validated all playbooks (especially important for certain compliance frameworks)
-
You have established SLOs in Datadog for all mission-critical business functions and educated the teams on their value
-
You are aware of your AWS spend and have budgets in place
-
You have a process in place for managing all SaaS/Vendor relationships, enterprise agreements
-
You have a formalized incident response process for Application, Infrastructure and Security
-
You have a post-mortem review process
Developer
-
You are comfortable with adding new services to the CI/CD pipeline
-
You are leveraging preview environments (if applicable) as part of your development workflow
-
You know where to look for logs emitted from your service
-
You know how to add telemetry to your services
-
You are able to debug/triage services without direct access to the cluster or pods
-
You are able to login to AWS using Leapp and experiment in the sandbox account
-
You are familiar with the https://cloudposse.com/12-factor-app/
-
You have developed a
Dockerfile
for your service(s) and understand what every line in that file does -
You have developed a Helm chart (if you’re using EKS)
Operations
-
You are able to debug/triage services without direct access to the cluster or pods
-
You are able to find the logs for services and filter for what you need
-
You are comfortable upgrading EKS with Addons
-
You know how to develop (and have developed) a new terraform component and add them to stacks
-
You can explain the core Concepts and use them as part of your design process
-
You are current on all of the Conventions used
-
You are familiar with all the backup policies and how to restore backups
-
You are familiar with the upgrade process for all managed services (E.g. RDS, MSK, etc)
-
You are regularly performing upgrades to keep everything current (see How to Keep Everything Up to Date)
-
You are using Spacelift on a daily basis as part of your job
-
You are able to define new IAM roles and configure/manage AWS SSO/SAML
-
You are able to authenticate with all required systems (E.g. AWS, Datadog, OpsGenie, Spacelift).
-
You are familiar with the DNS architecture and how to work with vanity domains and service discovery domains
SRE
-
You are able to debug/triage services without direct access to the cluster or pods
-
You are able to authenticate with all required systems (E.g. AWS, Datadog, OpsGenie, Spacelift).
-
You are able to open up AWS support tickets as needed
-
You are able to open up support tickets with all external vendors
-
You are able to open up internal support tickets and know how to triage them
-
You are aware of and help maintain SLOs
-
You are writing RCA and post mortems for incidents
-
You are able to develop new monitors, dashboards, SLOs, synthetics in Datadog
-
You are able to configure incident management in OpsGenie IaC configurations to control escalation paths
Security
-
You are able to ingest cloudtrail logs and identify, classify and attribute events
-
You are familiar with the IAM architecture and implementation of custom roles
-
You are aware of and able to manage AWS SSO or AWS Federated IAM with integrations to your IdP
-
You know where to look and are able to respond to events surfaced by AWS SecurityHub, GuardDuty, etc
Release Engineer
-
You are able to debug/triage workflows
-
You are able to tune and optimize Github Actions (if applicable)