Skip to main content

How to Create a Migration Checklist

Problem

Solution

warning

This didn't export cleanly and needs to be reworked.

Pre-cutover Tasks

  • Update EKS Task with ElasticSearch URL
  • Ensure ALB ingress can support 2 active vanity domains
  • Create ACM component
  • Audit Security Groups
  • Implement Database Seeding & Migration Strategy
  • Create release pipeline
  • Redeploy bastion on public and private subnets
  • SSH access via bastion host
  • Restore most recent snapshot from production to staging
  • Attach additional scratch space to EKS tasks for web app
  • Bastion Host user-data/ACL updates
  • Investigate 502s
  • Tune EKS web app and tasks
  • Implement Lambda Log Parser for Cloudwatch Logs
  • Update Spacelift to Trigger on Changes to TF Var files
  • Rename SSM Parameters for AWS_* to LEGACY_AWS_*
  • Add feature flag to disable scheduled tasks for Preview
  • Provision Read Replica For Production Database (used by Redshift)
  • Implement Scheduled Tasks
  • Implement Pipeline to Create Preview Environments
  • Decide on Cut Over Plan
  • Deploy sidekiq workers (high priority)
  • Containers should log to stdout
  • Decide on Pipeline Strategy
  • Implement registry/terraform/eks-web-app module
  • Deploy acme to Production (app.acme.com)
  • Integrate EKS Web App with Cloudwatch Logs
  • Implement Vanity DNS with EKS Tasks
  • Deploy http://acme.com to Staging (app.acme.com)
  • Deploy http://acme.com as EKS Task with Spacelift
  • Create build pipeline
  • Reduce Scope of IAM Grants for GitHub Runners
  • Create deploy pipeline
  • ETL Postgres Databases to Bastion Instance
  • Import Staging Database to All RDS Clusters for Testing
  • Update Spacelift Config to Assume Role before Apply
  • Implement Preview Environment Destroy Pipeline
  • Increase GitHub Runners volume sizes
  • Make sure all required backing services are provisioned on *acme accounts
  • Setup http://acme.com staging domain
  • Move aurora-potsgres from *acme accounts to *acme
  • Setup http://acme.com temp vanity domain
  • Deploy bastion to corp account
  • Update RDS Maintenance Window
  • Provision ECS Bastion Instance with SSM Agent
  • Decide How to Run Database Migrations
  • Decide on Database Seeding Strategy
  • Decide on deployment strategy for repository
  • Decide on Log Group Architecture
  • Implement cloudposse/terraform-aws-code-deploy module
  • Add Instance Profile to GitHub Runners to Support Pushing to ECR
  • Use Postgres terraform provider to manage users
  • Deploy self-hosted GitHub Action Runners with Terraform
  • Proposal: Implement GitOps-driven Continuous Delivery Pipeline for Microservices and Preview Environments
  • Decide on RDS Maintenance Window
  • Move remaining child modules from acme-com to infrastructure registry

Cutover Plan

Rollback Plan
  • Verify Backup Integrity and Recency
  • Ensure ability to perform software rollbacks with automation (E.g. CI/CD)
  • Prepare step-by-step plan to rollback to Heroku
External Availability Monitoring
  • Enable “Real User Monitoring” (RUM). Establish a 1-2 week baseline before launch
  • Enable external synthetic tests 2-4 weeks before launch to identify any potential stability problems (e.g. during deployments)
Exception Logging
  • Ensure you have frontend/javascript exception logging enabled in Datadog
QA
  • Test & Time Restore Process (x minutes)
  • Audit errors/warnings from pg_restore to ensure they are acceptable
  • Coordinate with QA team on acceptance testing
  • Ensure robots.txt blocks crawlers on non-prod environments
Load Tests
  • Replicate production workloads to ensure systems handle as expected
  • Tune EKS Autoscaling
  • Verify Alert Escalations
Reduce DNS TTLs
  • Set all SOAs for TLDs (e.g. acme.com) to 60 seconds to mitigate effects of negative DNS caching
  • Set TTLs to 60 seconds on branded domains (E.g. acme.com)
Security
  • Audit Security Groups (EKS & RDS)
Schedule Cut Over
  • Identify all relevant parties, stakeholders
  • Communicate scope of migration and any expected downtime
Prepare Maintenance Page
  • Provide a means to display a maintenance page (if necessary)
  • Should be a static page (e.g. hosted on S3)
  • Update copy as necessary to communicate extent of the outage our downtime
Perform End-to-End Tests
  • Verify deployments are working
  • Verify software rollbacks are working
  • Verify auto-scaling is working (pods and nodes) - or we can over-provision for go-live
  • Verify TLS certificates are in working order (non-staging)
  • Verify logs are flowing to cloudwatch and Datadog
  • Verify TLD redirects are working
Perform Cut-Over
  • [Choose time] Activate Maintenance Page
  • Delegate http://acme.com zone to new account
  • Take Fresh Production Database Dump on Bastion
  • Load Database Dump
  • Update env vars in Production SSM to use prod settings from 1password
  • Disable Heroku deployments
  • Perform ACM flip for http://acme.com
  • Disable monitoring?
  • Merge/Rebase main into acme-master
  • Open PR for acme-master into main
  • replace acme-master with master in github
  • Merge the PR to master
  • Merge the auto-generated PR in infra
  • Confirm ALL deployments in spacelift
  • Instruct QA team to commence testing on app.acme.com
  • Flip CNAME for http://acme.com to http://acme.com in legacy account
  • Manual TLS validation for http://acme.com ACM
  • Instruct QA team to commence testing on app.acme.com
  • Enable monitoring
  • Deactivate Maintenance Page (happens automatically by flipping DNS)
Post-Cut-over Checklist
  • Verify ability to deploy
  • Monitor customer support tickets
  • re-enable scheduled EKS tasks for production
  • Review exception logs
  • Review Slow Query Logs
  • Monitor non-200 status codes for anomalies
  • Check Real End User Data
  • Audit Errors/Warnings after loading
  • Ensure robots.txt is permitting indexing in production (SEO)

Post Cutover Tasks

  • Ensure Idempotent Plan for Scheduled EKS Tasks
  • Rename acme component to acme-com
  • Configure auto-scaling
  • Fix Bastion host to access Redis
  • Tune Healthcheck Settings
  • Automatically add migrate label
  • Improve Automated PR Descriptions
  • Clean up acme Artifacts In Spacelift (no longer needed after move to acme-com)
  • Update Spacelift for acme
  • Remove unneeded resources from data accounts
Someday
  • Prepare acme.com vanity domain in prod and all DNS records (do not delegate NS)