Decide on Database Schema Migration Strategy
Problem and Context
We must decided how and when to run database migrations. The strategy will depend on several factors, including whether or not any automated tools are already in place to handle these migrations and the platform (e.g. ECS or EKS).
Considerations
-
During rolling application deployments, there will be a period when 2 versions of your application are live.
-
Migrations that happen before the rolling update, mean that the previous version of the application will be forced to use the new schema
-
Migrations that happen after the rolling update, mean that the next version of the application will be forced to use the previous version of the schema
-
Adjacent releases of the application must be backward compatible with schemas.
-
Never delete columns (or rows), only add columns
Questions
-
Should migrations happen before or after the application rollout?
-
What should happen during application rollbacks?
-
Does the schema get rolled back or stay ahead?
-
What software are you using to handle migrations?
-
What happens if pods or nodes are scaling during a database migration? (e.g. old versions of pods can come up)
Options
Option 1: Implement migrations as part of entry point initialization of docker container
-
If you have 25 containers running, each one will attempt to obtain a lock (if you’re lucky) and perform the migration. Many customers don’t like that each container attempts this and prefer it to happen before or after rollout.
-
At any given point, different versions of the app will be using different versions of the schema (e.g. rolling updates)
-
Long migrations will cause health checks to fail and the pods will get restarted aborting the migrations. Extending the timeouts for health checks means slowing down recovery for legitimate failures.
Option 2: Implement migrations as part of the CD workflow
-
This is nice because we can control when it happens in the release process
-
It’s complicated when doing asynchronous deployments with ArgoCD or spacelift. Since deployments are happening outside of the GitHub action workflow, we don’t know when steps are completed.
Option 3: Implement Manually triggered Workflows (E.g. via GitHub Action workflow dispatches)
- You have full control over when it runs, but it’s not automated in relation to your workflows. The actual execution is automated.
Option 4: Implement Kubernetes Job or ECS Task
-
Easy to implement
-
Works well, when it works. When it fails, it’s hard to regulate what happens with the services.
-
Kubernetes will keep re-attempting the migration if the job exit’s non-zero. If we squash the exit code, then we don’t realize it’s failing, unless there’s other monitoring in place.