How to Scale Spacelift Runners
Problem
There are hundreds or thousands of proposed runs pending in Spacelift. Autoscaling will eventually scale up to handle capacity but want faster turnaround time.
Solution
If there are too many Spacelift runs triggered by PRs that do not require Spacelift to run at the moment. The user can add a label called spacelift-no-trigger
on the PR to prevent Spacelift stacks from running on each commit. This label should be removed before the last commit (before approval and merge) so the Spacelift stacks can be validated.
To reduce Spacelift runs triggered by PRs, ensure that the spacelift
component is using the latest upstream module or 0.46.0
at the very least. This release has an updated policy that cancels previous commit runs if a new commit is pushed.
Scaling EC2 Runners via Stack Configurations
While we always recommend making all changes to IaC via Pull Request, if the desire is to increase the size of the Auto Scale Group and there are hundreds or thousands of pending runs, then the Pull Request will suffer the same fate as the other runs and be blocked from execution until the other runs complete.
The parameters that affect scale up behavior are:
Setting | Description |
---|---|
min_size | The minimum of the instances |
max_size | The maximum number of instances |
desired_capacity | The number of instances desired online all the time. |
health_check_grace_period | |
spacelift_agents_per_node | The number of Spacelift agents running per EC2 instance. Increase for greater density of agents per node. |
cpu_utilization_high_threshold_percent | Percent CPU utilization (integer between 0 and 100) required to trigger a scale up event |
wait_for_capacity_timeout | Before scaling further, wait for this many seconds for capacity requirements to be met. |
The parameters that affect scale-down behavior are:
Setting | Description |
---|---|
default_cooldown | |
scale_down_cooldown_seconds | |
cpu_utilization_low_threshold_percent | Percent CPU utilization (integer between 0 and 100) required to trigger a scale down event. The higher this number, the more likely the cluster will be scaled down because nodes are typically not very CPU intensive on average. |
max_size |
Here’s a sample spacelift-worker-pool
configuration.
components:
terraform:
spacelift-worker-pool:
settings:
spacelift:
workspace_enabled: true
vars:
enabled: true
ecr_stage_name: artifacts
ecr_tenant_name: mgmt
ecr_environment_name: uw2
ecr_repo_name: infrastructure
ecr_region: us-west-2
ecr_account_id: "1234567890"
spacelift_api_endpoint: https://acme.app.spacelift.io
spacelift_runner_image: 1234567890.dkr.ecr.us-west-2.amazonaws.com/infrastructure:latest
instance_type: "c5a.xlarge" # cost-effective x86 instance with 4vCPU, 8GB (see: https://aws.amazon.com/ec2/instance-types/)
wait_for_capacity_timeout: "10m"
spacelift_agents_per_node: 2
min_size: 3
max_size: 50
desired_capacity: null
default_cooldown: 300
scale_down_cooldown_seconds: 2700
# Set a low scaling threshold to ensure new workers are launched as soon as the current one(s) are busy
cpu_utilization_high_threshold_percent: 10
cpu_utilization_low_threshold_percent: 5
health_check_type: EC2
health_check_grace_period: 300
termination_policies:
- OldestLaunchConfiguration
ebs_optimized: true
block_device_mappings:
- device_name: "/dev/xvda"
no_device: null
virtual_name: null
ebs:
delete_on_termination: null
encrypted: false
iops: null
kms_key_id: null
snapshot_id: null
volume_size: 100
volume_type: "gp2"
iam_attributes:
- admin
instance_refresh:
# https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-instance-refresh.html
# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/reference/autoscaling_group#instance_refresh
strategy: Rolling
preferences:
# The number of seconds until a newly launched instance is configured and ready to use
# Default behavior is to use the Auto Scaling Group's health check grace period
instance_warmup: null
# The amount of capacity in the Auto Scaling group that must remain healthy during an instance refresh to allow the operation to continue,
# as a percentage of the desired capacity of the Auto Scaling group
min_healthy_percentage: 50
triggers: null
Scaling EC2 Runners Immediately (via AWS Console)
Scaling Up
-
Login to the AWS console for the automation account.
-
Find the autoscale group under the EC2 menu and increase the autoscaling group size.
Scaling Down
IMPORTANT Before scaling spacelift workers down, check the “Pending runs” and “busy workers” are both 0. In other words do not scale down when spacelift is busy. If the worker is terminated in the middle of a job, it will leave a stale terraform state lock and maybe broken state. Stack lock can be fixed using “terraform force-unlock“ as suggested in a command line prompt.
-
Login to the AWS console for the automation account.
-
Find the autoscale group under the EC2 menu decrease the autoscaling group size
Scaling EC2 Runners Immediately (via aws
CLI)
Get the name of the ASG
AWS_PROFILE=$NAMESPACE-gbl-auto-admin
# Get ASG name
ASG_NAME=$(aws autoscaling describe-auto-scaling-groups \
--query 'AutoScalingGroups | [?contains(AutoScalingGroupName, `spacelift`)] | [0].AutoScalingGroupName' \
--output text)
# Set the max of the ASG
aws autoscaling update-auto-scaling-group --auto-scaling-group-name $ASG_NAME --max-size 10
# Set the desired capacity of the ASG
aws autoscaling set-desired-capacity --auto-scaling-group-name $ASG_NAME --desired-capacity 10
k8s (Not Yet Supported)
Cloud Posse has not yet implemented Kubernetes Spacelift runners.