actions-runner-controller
This component creates a Helm release for actions-runner-controller on an EKS cluster.
Usage
Stack Level: Regional
Once the catalog file is created, the file can be imported as follows.
import:
- catalog/eks/actions-runner-controller
...
The default catalog values e.g. stacks/catalog/eks/actions-runner-controller.yaml
components:
terraform:
eks/actions-runner-controller:
vars:
enabled: true
name: "actions-runner" # avoids hitting name length limit on IAM role
chart: "actions-runner-controller"
chart_repository: "https://actions-runner-controller.github.io/actions-runner-controller"
chart_version: "0.23.7"
kubernetes_namespace: "actions-runner-system"
create_namespace: true
kubeconfig_exec_auth_api_version: "client.authentication.k8s.io/v1beta1"
# helm_manifest_experiment_enabled feature causes inconsistent final plans with charts that have CRDs
# see https://github.com/hashicorp/terraform-provider-helm/issues/711#issuecomment-836192991
helm_manifest_experiment_enabled: false
ssm_github_secret_path: "/github_runners/controller_github_app_secret"
github_app_id: "REPLACE_ME_GH_APP_ID"
github_app_installation_id: "REPLACE_ME_GH_INSTALLATION_ID"
# use to enable docker config json secret, which can login to dockerhub for your GHA Runners
docker_config_json_enabled: true
# The content of this param should look like:
# {
# "auths": {
# "https://index.docker.io/v1/": {
# "username": "your_username",
# "password": "your_password
# "email": "your_email",
# "auth": "$(echo "your_username:your_password" | base64)"
# }
# }
# } | base64
ssm_docker_config_json_path: "/github_runners/docker/config-json"
# ssm_github_webhook_secret_token_path: "/github_runners/github_webhook_secret_token"
# The webhook based autoscaler is much more efficient than the polling based autoscaler
webhook:
enabled: true
hostname_template: "gha-webhook.%[3]v.%[2]v.%[1]v.acme.com"
eks_component_name: "eks/cluster"
resources:
limits:
cpu: 500m
memory: 256Mi
requests:
cpu: 250m
memory: 128Mi
runners:
infra-runner:
node_selector:
kubernetes.io/os: "linux"
kubernetes.io/arch: "amd64"
type: "repository" # can be either 'organization' or 'repository'
dind_enabled: true # If `true`, a Docker daemon will be started in the runner Pod.
# To run Docker in Docker (dind), change image to summerwind/actions-runner-dind
# If not running Docker, change image to summerwind/actions-runner use a smaller image
image: summerwind/actions-runner-dind
# `scope` is org name for Organization runners, repo name for Repository runners
scope: "org/infra"
min_replicas: 0 # Default, overridden by scheduled_overrides below
max_replicas: 20
# Scheduled overrides. See https://github.com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md#scheduled-overrides
# Order is important. The earlier entry is prioritized higher than later entries. So you usually define
# one-time overrides at the top of your list, then yearly, monthly, weekly, and lastly daily overrides.
scheduled_overrides:
# Override the daily override on the weekends
- start_time: "2024-07-06T00:00:00-08:00" # Start of Saturday morning Pacific Standard Time
end_time: "2024-07-07T23:59:59-07:00" # End of Sunday night Pacific Daylight Time
min_replicas: 0
recurrence_rule:
frequency: "Weekly"
# Keep a warm pool of runners during normal working hours
- start_time: "2024-07-01T09:00:00-08:00" # 9am Pacific Standard Time (8am PDT), start of workday
end_time: "2024-07-01T17:00:00-07:00" # 5pm Pacific Daylight Time (6pm PST), end of workday
min_replicas: 2
recurrence_rule:
frequency: "Daily"
scale_down_delay_seconds: 100
resources:
limits:
cpu: 200m
memory: 512Mi
requests:
cpu: 100m
memory: 128Mi
webhook_driven_scaling_enabled: true
# max_duration is the duration after which a job will be considered completed,
# (and the runner killed) even if the webhook has not received a "job completed" event.
# This is to ensure that if an event is missed, it does not leave the runner running forever.
# Set it long enough to cover the longest job you expect to run and then some.
# See https://github.com/actions/actions-runner-controller/blob/9afd93065fa8b1f87296f0dcdf0c2753a0548cb7/docs/automatically-scaling-runners.md?plain=1#L264-L268
max_duration: "90m"
# Pull-driven scaling is obsolete and should not be used.
pull_driven_scaling_enabled: false
# Labels are not case-sensitive to GitHub, but *are* case-sensitive
# to the webhook based autoscaler, which requires exact matches
# between the `runs-on:` label in the workflow and the runner labels.
labels:
- "Linux"
- "linux"
- "Ubuntu"
- "ubuntu"
- "X64"
- "x64"
- "x86_64"
- "amd64"
- "AMD64"
- "core-auto"
- "common"
# Uncomment this additional runner if you want to run a second
# runner pool for `arm64` architecture
#infra-runner-arm64:
# node_selector:
# kubernetes.io/os: "linux"
# kubernetes.io/arch: "arm64"
# # Add the corresponding taint to the Kubernetes nodes running `arm64` architecture
# # to prevent Kubernetes pods without node selectors from being scheduled on them.
# tolerations:
# - key: "kubernetes.io/arch"
# operator: "Equal"
# value: "arm64"
# effect: "NoSchedule"
# type: "repository" # can be either 'organization' or 'repository'
# dind_enabled: false # If `true`, a Docker sidecar container will be deployed
# # To run Docker in Docker (dind), change image to summerwind/actions-runner-dind
# # If not running Docker, change image to summerwind/actions-runner use a smaller image
# image: summerwind/actions-runner-dind
# # `scope` is org name for Organization runners, repo name for Repository runners
# scope: "org/infra"
# group: "ArmRunners"
# # Tell Karpenter not to evict this pod while it is running a job.
# # If we do not set this, Karpenter will feel free to terminate the runner while it is running a job,
# # as part of its consolidation efforts, even when using "on demand" instances.
# running_pod_annotations:
# karpenter.sh/do-not-disrupt: "true"
# min_replicas: 0 # Set to so that no ARM instance is running idle, set to 1 for faster startups
# max_replicas: 20
# scale_down_delay_seconds: 100
# resources:
# limits:
# cpu: 200m
# memory: 512Mi
# requests:
# cpu: 100m
# memory: 128Mi
# webhook_driven_scaling_enabled: true
# max_duration: "90m"
# pull_driven_scaling_enabled: false
# # Labels are not case-sensitive to GitHub, but *are* case-sensitive
# # to the webhook based autoscaler, which requires exact matches
# # between the `runs-on:` label in the workflow and the runner labels.
# # Leave "common" off the list so that "common" jobs are always
# # scheduled on the amd64 runners. This is because the webhook
# # based autoscaler will not scale a runner pool if the
# # `runs-on:` labels in the workflow match more than one pool.
# labels:
# - "Linux"
# - "linux"
# - "Ubuntu"
# - "ubuntu"
# - "amd64"
# - "AMD64"
# - "core-auto"
Generating Required Secrets
AWS SSM is used to store and retrieve secrets.
Decide on the SSM path for the GitHub secret (PAT or Application private key) and GitHub webhook secret.
Since the secret is automatically scoped by AWS to the account and region where the secret is stored, we recommend the
secret be stored at /github_runners/controller_github_app_secret
unless you plan on running multiple instances of the
controller. If you plan on running multiple instances of the controller, and want to give them different access
(otherwise they could share the same secret), then you can add a path component to the SSM path. For example
/github_runners/cicd/controller_github_app_secret
.
ssm_github_secret_path: "/github_runners/controller_github_app_secret"
The preferred way to authenticate is by creating and installing a GitHub App. This is the recommended approach as it allows for more much more restricted access than using a personal access token, at least until fine-grained personal access token permissions are generally available. Follow the instructions here to create and install the GitHub App.
At the creation stage, you will be asked to generate a private key. This is the private key that will be used to
authenticate the Action Runner Controller. Download the file and store the contents in SSM using the following command,
adjusting the profile and file name. The profile should be the admin
role in the account to which you are deploying
the runner controller. The file name should be the name of the private key file you downloaded.
AWS_PROFILE=acme-mgmt-use2-auto-admin chamber write github_runners controller_github_app_secret -- "$(cat APP_NAME.DATE.private-key.pem)"
You can verify the file was correctly written to SSM by matching the private key fingerprint reported by GitHub with:
AWS_PROFILE=acme-mgmt-use2-auto-admin chamber read -q github_runners controller_github_app_secret | openssl rsa -in - -pubout -outform DER | openssl sha256 -binary | openssl base64
At this stage, record the Application ID and the private key fingerprint in your secrets manager (e.g. 1Password). You will need the Application ID to configure the runner controller, and want the fingerprint to verify the private key.
Proceed to install the GitHub App in the organization or repository you want to use the runner controller for, and record the Installation ID (the final numeric part of the URL, as explained in the instructions linked above) in your secrets manager. You will need the Installation ID to configure the runner controller.
In your stack configuration, set the following variables, making sure to quote the values so they are treated as strings, not numbers.
github_app_id: "12345"
github_app_installation_id: "12345"
OR (obsolete)
- A PAT with the scope outlined in
this document.
Save this to the value specified by
ssm_github_token_path
using the following command, adjusting the AWS_PROFILE to refer to theadmin
role in the account to which you are deploying the runner controller:
AWS_PROFILE=acme-mgmt-use2-auto-admin chamber write github_runners controller_github_app_secret -- "<PAT>"
- If using the Webhook Driven autoscaling (recommended), generate a random string to use as the Secret when creating the webhook in GitHub.
Generate the string using 1Password (no special characters, length 45) or by running
dd if=/dev/random bs=1 count=33 2>/dev/null | base64
Store this key in AWS SSM under the same path specified by ssm_github_webhook_secret_token_path
ssm_github_webhook_secret_token_path: "/github_runners/github_webhook_secret"
Dockerhub Authentication
Authenticating with Dockerhub is optional but when enabled can ensure stability by increasing the number of pulls allowed from your runners.
To get started set docker_config_json_enabled
to true
and ssm_docker_config_json_path
to the SSM path where the
credentials are stored, for example github_runners/docker
.
To create the credentials file, fill out a JSON file locally with the following content:
{
"auths": {
"https://index.docker.io/v1/": {
"username": "your_username",
"password": "your_password",
"email": "your_email",
"auth": "$(echo "your_username: your_password" | base64)"
}
}
}
Then write the file to SSM with the following Atmos Workflow:
save/docker-config-json:
description: Prompt for uploading Docker Config JSON to the AWS SSM Parameter Store
steps:
- type: shell
command: |-
echo "Please enter the Docker Config JSON file path"
echo "See https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry for information on how to create the file"
read -p "Docker Config JSON file path: " -r DOCKER_CONFIG_JSON_FILE_PATH
if [ -z "DOCKER_CONFIG_JSON_FILE_PATH" ]
then
echo 'Inputs cannot be blank please try again!'
exit 0
fi
DOCKER_CONFIG_JSON=$(<$DOCKER_CONFIG_JSON_FILE_PATH);
ENCODED_DOCKER_CONFIG_JSON=$(echo "$DOCKER_CONFIG_JSON" | base64 -w 0 );
echo $DOCKER_CONFIG_JSON
echo $ENCODED_DOCKER_CONFIG_JSON
AWS_PROFILE=acme-core-gbl-auto-admin
set -e
chamber write github_runners/docker config-json -- "$ENCODED_DOCKER_CONFIG_JSON"
echo 'Saved Docker Config JSON to the AWS SSM Parameter Store'
Don't forget to update the AWS Profile in the script.
Using Runner Groups
GitHub supports grouping runners into distinct
Runner Groups,
which allow you to have different access controls for different runners. Read the linked documentation about creating
and configuring Runner Groups, which you must do through the GitHub Web UI. If you choose to create Runner Groups, you
can assign one or more Runner pools (from the runners
map) to groups (only one group per runner pool) by including
group: <Runner Group Name>
in the runner configuration. We recommend including it immediately after scope
.
Using Webhook Driven Autoscaling (recommended)
We recommend using Webhook Driven Autoscaling until GitHub's own autoscaling solution is as capable as the Summerwind solution this component deploys. See this discussion for some perspective on why the Summerwind solution is currently (summer 2024) considered superior.
To use the Webhook Driven Autoscaling, in addition to setting webhook_driven_scaling_enabled
to true
, you must also
install the GitHub organization-level webhook after deploying the component (specifically, the webhook server). The URL
for the webhook is determined by the webhook.hostname_template
and where it is deployed. Recommended URL is
https://gha-webhook.[environment].[stage].[tenant].[service-discovery-domain]
.
As a GitHub organization admin, go to https://github.com/organizations/[organization]/settings/hooks
, and then:
- Click"Add webhook" and create a new webhook with the following settings:
- Payload URL: copy from Terraform output
webhook_payload_url
- Content type:
application/json
- Secret: whatever you configured in the
sops
secret above - Which events would you like to trigger this webhook:
- Select "Let me select individual events"
- Uncheck everything ("Pushes" is likely the only thing already selected)
- Check "Workflow jobs"
- Ensure that "Active" is checked (should be checked by default)
- Click "Add webhook" at the bottom of the settings page
- Payload URL: copy from Terraform output
After the webhook is created, select "edit" for the webhook and go to the "Recent Deliveries" tab and verify that there
is a delivery (of a "ping" event) with a green check mark. If not, verify all the settings and consult the logs of the
actions-runner-controller-github-webhook-server
pod.
Configuring Webhook Driven Autoscaling
The HorizontalRunnerAutoscaler scaleUpTriggers.duration
(see [Webhook Driven Scaling documentation](https://github.
com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md#webhook-driven-scaling)) is
controlled by the max_duration
setting for each Runner. The purpose of this timeout is to ensure, in case a job
cancellation or termination event gets missed, that the resulting idle runner eventually gets terminated.
How the Autoscaler Determines the Desired Runner Pool Size
When a job is queued, a capacityReservation
is created for it. The HRA (Horizontal Runner Autoscaler) sums up all the
capacity reservations to calculate the desired size of the runner pool, subject to the limits of minReplicas
and
maxReplicas
. The idea is that a capacityReservation
is deleted when a job is completed or canceled, and the pool
size will be equal to jobsStarted - jobsFinished
. However, it can happen that a job will finish without the HRA being
successfully notified about it, so as a safety measure, the capacityReservation
will expire after a configurable
amount of time, at which point it will be deleted without regard to the job being finished. This ensures that eventually
an idle runner pool will scale down to minReplicas
.
If it happens that the capacity reservation expires before the job is finished, the Horizontal Runner Autoscaler (HRA)
will scale down the pool by 2 instead of 1: once because the capacity reservation expired, and once because the job
finished. This will also cause starvation of waiting jobs, because the next in line will have its timeout timer started
but will not actually start running because no runner is available. And if minReplicas
is set to zero, the pool will
scale down to zero before finishing all the jobs, leaving some waiting indefinitely. This is why it is important to set
the max_duration
to a time long enough to cover the full time a job may have to wait between the time it is queued and
the time it finishes, assuming that the HRA scales up the pool by 1 and runs the job on the new runner.
If there are more jobs queued than there are runners allowed by maxReplicas
, the timeout timer does not start on the
capacity reservation until enough reservations ahead of it are removed for it to be considered as representing and
active job. Although there are some edge cases regarding max_duration
that seem not to be covered properly (see
actions-runner-controller issue #2466), they only
merit adding a few extra minutes to the timeout.
Recommended max_duration
Duration
Consequences of Too Short of a max_duration
Duration
If you set max_duration
to too short a duration, the Horizontal Runner Autoscaler will cancel capacity reservations
for jobs that have not yet finished, and the pool will become too small. This will be most serious if you have set
minReplicas = 0
because in this case, jobs will be left in the queue indefinitely. With a higher value of
minReplicas
, the pool will eventually make it through all the queued jobs, but not as quickly as intended due to the
incorrectly reduced capacity.
Consequences of Too Long of a max_duration
Duration
If the Horizontal Runner Autoscaler misses a scale-down event (which can happen because events do not have delivery
guarantees), a runner may be left running idly for as long as the max_duration
duration. The only problem with this is
the added expense of leaving the idle runner running.
Recommendation
As a result, we recommend setting max_duration
to a period long enough to cover:
- The time it takes for the HRA to scale up the pool and make a new runner available
- The time it takes for the runner to pick up the job from GitHub
- The time it takes for the job to start running on the new runner
- The maximum time a job might take
Because the consequences of expiring a capacity reservation before the job is finished can be severe, we recommend
setting max_duration
to a period at least 30 minutes longer than you expect the longest job to take. Remember, when
everything works properly, the HRA will scale down the pool as jobs finish, so there is little cost to setting a long
duration, and the cost looks even smaller by comparison to the cost of having too short a duration.
For lightly used runner pools expecting only short jobs, you can set max_duration
to "30m"
. As a rule of thumb, we
recommend setting maxReplicas
high enough that jobs never wait on the queue more than an hour.
Interaction with Karpenter or other EKS autoscaling solutions
Kubernetes cluster autoscaling solutions generally expect that a Pod runs a service that can be terminated on one Node and restarted on another with only a short duration needed to finish processing any in-flight requests. When the cluster is resized, the cluster autoscaler will do just that. However, GitHub Action Runner Jobs do not fit this model. If a Pod is terminated in the middle of a job, the job is lost. The likelihood of this happening is increased by the fact that the Action Runner Controller Autoscaler is expanding and contracting the size of the Runner Pool on a regular basis, causing the cluster autoscaler to more frequently want to scale up or scale down the EKS cluster, and, consequently, to move Pods around.
To handle these kinds of situations, Karpenter respects an annotation on the Pod:
spec:
template:
metadata:
annotations:
karpenter.sh/do-not-disrupt: "true"
When you set this annotation on the Pod, Karpenter will not evict it. This means that the Pod will stay on the Node it is on, and the Node it is on will not be considered for eviction. This is good because it means that the Pod will not be terminated in the middle of a job. However, it also means that the Node the Pod is on will not be considered for termination, which means that the Node will not be removed from the cluster, which means that the cluster will not shrink in size when you would like it to.
Since the Runner Pods terminate at the end of the job, this is not a problem for the Pods actually running jobs.
However, if you have set minReplicas > 0
, then you have some Pods that are just idling, waiting for jobs to be
assigned to them. These Pods are exactly the kind of Pods you want terminated and moved when the cluster is
underutilized. Therefore, when you set minReplicas > 0
, you should NOT set karpenter.sh/do-not-evict: "true"
on
the Pod via the pod_annotations
attribute of the runners
input. (But wait, there is good news!)
We have requested a feature that will allow you to
set karpenter.sh/do-not-disrupt: "true"
and minReplicas > 0
at the same time by only annotating Pods running jobs.
Meanwhile, we have implemented this for you using a job startup hook. This hook will set annotations on the Pod when
the job starts. When the job finishes, the Pod will be deleted by the controller, so the annotations will not need to be
removed. Configure annotations that apply only to Pods running jobs in the running_pod_annotations
attribute of the
runners
input.
Updating CRDs
When updating the chart or application version of actions-runner-controller
, it is possible you will need to install
new CRDs. Such a requirement should be indicated in the actions-runner-controller
release notes and may require some
adjustment to our custom chart or configuration.
This component uses helm
to manage the deployment, and helm
will not auto-update CRDs. If new CRDs are needed,
install them manually via a command like
kubectl create -f https://raw.githubusercontent.com/actions-runner-controller/actions-runner-controller/master/charts/actions-runner-controller/crds/actions.summerwind.dev_horizontalrunnerautoscalers.yaml
Useful Reference
Consult actions-runner-controller documentation for further details.
Variables
Required Variables
chart
(string
) requiredChart name to be installed. The chart name can be local path, a URL to a chart, or the name of the chart if
repository
is specified. It is also possible to use the<repository>/<chart>
format here if you are running Terraform on a system that the repository has been added to withhelm repo add
but this is not recommended.chart_repository
(string
) requiredRepository URL where to locate the requested chart.
kubernetes_namespace
(string
) requiredThe namespace to install the release into.
region
(string
) requiredAWS Region.
resources
requiredThe cpu and memory of the deployment's limits and requests.
Type:
object({
limits = object({
cpu = string
memory = string
})
requests = object({
cpu = string
memory = string
})
})runners
requiredMap of Action Runner configurations, with the key being the name of the runner. Please note that the name must be in
kebab-case.For example:
organization_runner = {<br/>
type = "organization" # can be either 'organization' or 'repository'<br/>
dind_enabled: true # A Docker daemon will be started in the runner Pod<br/>
image: summerwind/actions-runner-dind # If dind_enabled=false, set this to 'summerwind/actions-runner'<br/>
scope = "ACME" # org name for Organization runners, repo name for Repository runners<br/>
group = "core-automation" # Optional. Assigns the runners to a runner group, for access control.<br/>
scale_down_delay_seconds = 300<br/>
min_replicas = 1<br/>
max_replicas = 5<br/>
labels = [<br/>
"Ubuntu",<br/>
"core-automation",<br/>
]<br/>
}<br/>
```<br/>
<br/>
**Type:**
```hcl
map(object({
type = string
scope = string
group = optional(string, null)
image = optional(string, "summerwind/actions-runner-dind")
auto_update_enabled = optional(bool, true)
dind_enabled = optional(bool, true)
node_selector = optional(map(string), {})
pod_annotations = optional(map(string), {})
# running_pod_annotations are only applied to the pods once they start running a job
running_pod_annotations = optional(map(string), {})
# affinity is too complex to model. Whatever you assigned affinity will be copied
# to the runner Pod spec.
affinity = optional(any)
tolerations = optional(list(object({
key = string
operator = string
value = optional(string, null)
effect = string
})), [])
scale_down_delay_seconds = optional(number, 300)
min_replicas = number
max_replicas = number
# Scheduled overrides. See https://github.com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md#scheduled-overrides
# Order is important. The earlier entry is prioritized higher than later entries. So you usually define
# one-time overrides at the top of your list, then yearly, monthly, weekly, and lastly daily overrides.
scheduled_overrides = optional(list(object({
start_time = string # ISO 8601 format, eg, "2021-06-01T00:00:00+09:00"
end_time = string # ISO 8601 format, eg, "2021-06-01T00:00:00+09:00"
min_replicas = optional(number)
max_replicas = optional(number)
recurrence_rule = optional(object({
frequency = string # One of Daily, Weekly, Monthly, Yearly
until_time = optional(string) # ISO 8601 format time after which the schedule will no longer apply
}))
})), [])
busy_metrics = optional(object({
scale_up_threshold = string
scale_down_threshold = string
scale_up_adjustment = optional(string)
scale_down_adjustment = optional(string)
scale_up_factor = optional(string)
scale_down_factor = optional(string)
}))
webhook_driven_scaling_enabled = optional(bool, true)
# max_duration is the duration after which a job will be considered completed,
# even if the webhook has not received a "job completed" event.
# This is to ensure that if an event is missed, it does not leave the runner running forever.
# Set it long enough to cover the longest job you expect to run and then some.
# See https://github.com/actions/actions-runner-controller/blob/9afd93065fa8b1f87296f0dcdf0c2753a0548cb7/docs/automatically-scaling-runners.md?plain=1#L264-L268
# Defaults to 1 hour programmatically (to be able to detect if both max_duration and webhook_startup_timeout are set).
max_duration = optional(string)
# The name `webhook_startup_timeout` was misleading and has been deprecated.
# It has been renamed `max_duration`.
webhook_startup_timeout = optional(string)
# Adjust the time (in seconds) to wait for the Docker in Docker daemon to become responsive.
wait_for_docker_seconds = optional(string, "")
pull_driven_scaling_enabled = optional(bool, false)
labels = optional(list(string), [])
# If not null, `docker_storage` specifies the size (as `go` string) of
# an ephemeral (default storage class) Persistent Volume to allocate for the Docker daemon.
# Takes precedence over `tmpfs_enabled` for the Docker daemon storage.
docker_storage = optional(string, null)
# storage is deprecated in favor of docker_storage, since it is only storage for the Docker daemon
storage = optional(string, null)
# If `pvc_enabled` is true, a Persistent Volume Claim will be created for the runner
# and mounted at /home/runner/work/shared. This is useful for sharing data between runners.
pvc_enabled = optional(bool, false)
# If `tmpfs_enabled` is `true`, both the runner and the docker daemon will use a tmpfs volume,
# meaning that all data will be stored in RAM rather than on disk, bypassing disk I/O limitations,
# but what would have been disk usage is now additional memory usage. You must specify memory
# requests and limits when using tmpfs or else the Pod will likely crash the Node.
tmpfs_enabled = optional(bool)
resources = optional(object({
limits = optional(object({
cpu = optional(string, "1")
memory = optional(string, "1Gi")
# ephemeral-storage is the Kubernetes name, but `ephemeral_storage` is the gomplate name,
# so allow either. If both are specified, `ephemeral-storage` takes precedence.
ephemeral-storage = optional(string)
ephemeral_storage = optional(string, "10Gi")
}), {})
requests = optional(object({
cpu = optional(string, "500m")
memory = optional(string, "256Mi")
# ephemeral-storage is the Kubernetes name, but `ephemeral_storage` is the gomplate name,
# so allow either. If both are specified, `ephemeral-storage` takes precedence.
ephemeral-storage = optional(string)
ephemeral_storage = optional(string, "1Gi")
}), {})
}), {})
}))
Optional Variables
atomic
(bool
) optionalIf set, installation process purges chart on fail. The wait flag will be set automatically if atomic is used.
Default value:
true
chart_description
(string
) optionalSet release description attribute (visible in the history).
Default value:
null
chart_values
(any
) optionalAdditional values to yamlencode as
helm_release
values.Default value:
{ }
chart_version
(string
) optionalSpecify the exact chart version to install. If this is not specified, the latest version is installed.
Default value:
null
cleanup_on_fail
(bool
) optionalAllow deletion of new resources created in this upgrade when upgrade fails.
Default value:
true
context_tags_enabled
(bool
) optionalWhether or not to include all context tags as labels for each runner
Default value:
false
controller_replica_count
(number
) optionalThe number of replicas of the runner-controller to run.
Default value:
2
create_namespace
(bool
) optionalCreate the namespace if it does not yet exist. Defaults to
false
.Default value:
null
docker_config_json_enabled
(bool
) optionalWhether the Docker config JSON is enabled
Default value:
false
eks_component_name
(string
) optionalThe name of the eks component
Default value:
"eks/cluster"
existing_kubernetes_secret_name
(string
) optionalIf you are going to create the Kubernetes Secret the runner-controller will use
by some means (such as SOPS) outside of this component, set the name of the secret
here and it will be used. In this case, this component will not create a secret
and you can leave the secret-related inputs with their default (empty) values.
The same secret will be used by both the runner-controller and the webhook-server.Default value:
""
github_app_id
(string
) optionalThe ID of the GitHub App to use for the runner controller.
Default value:
""
github_app_installation_id
(string
) optionalThe "Installation ID" of the GitHub App to use for the runner controller.
Default value:
""
helm_manifest_experiment_enabled
(bool
) optionalEnable storing of the rendered manifest for helm_release so the full diff of what is changing can been seen in the plan
Default value:
false
kube_data_auth_enabled
(bool
) optionalIf
true
, use anaws_eks_cluster_auth
data source to authenticate to the EKS cluster.
Disabled bykubeconfig_file_enabled
orkube_exec_auth_enabled
.Default value:
false
kube_exec_auth_aws_profile
(string
) optionalThe AWS config profile for
aws eks get-token
to useDefault value:
""
kube_exec_auth_aws_profile_enabled
(bool
) optionalIf
true
, passkube_exec_auth_aws_profile
as theprofile
toaws eks get-token
Default value:
false
kube_exec_auth_enabled
(bool
) optionalIf
true
, use the Kubernetes providerexec
feature to executeaws eks get-token
to authenticate to the EKS cluster.
Disabled bykubeconfig_file_enabled
, overrideskube_data_auth_enabled
.Default value:
true
kube_exec_auth_role_arn
(string
) optionalThe role ARN for
aws eks get-token
to useDefault value:
""
kube_exec_auth_role_arn_enabled
(bool
) optionalIf
true
, passkube_exec_auth_role_arn
as the role ARN toaws eks get-token
Default value:
true
kubeconfig_context
(string
) optionalContext to choose from the Kubernetes config file.
If supplied,kubeconfig_context_format
will be ignored.Default value:
""
kubeconfig_context_format
(string
) optionalA format string to use for creating the
kubectl
context name when
kubeconfig_file_enabled
istrue
andkubeconfig_context
is not supplied.
Must include a single%s
which will be replaced with the cluster name.Default value:
""
kubeconfig_exec_auth_api_version
(string
) optionalThe Kubernetes API version of the credentials returned by the
exec
auth pluginDefault value:
"client.authentication.k8s.io/v1beta1"
kubeconfig_file
(string
) optionalThe Kubernetes provider
config_path
setting to use whenkubeconfig_file_enabled
istrue
Default value:
""
kubeconfig_file_enabled
(bool
) optionalIf
true
, configure the Kubernetes provider withkubeconfig_file
and use that kubeconfig file for authenticating to the EKS clusterDefault value:
false
rbac_enabled
(bool
) optionalService Account for pods.
Default value:
true
s3_bucket_arns
(list(string)
) optionalList of ARNs of S3 Buckets to which the runners will have read-write access to.
Default value:
[ ]
ssm_docker_config_json_path
(string
) optionalSSM path to the Docker config JSON
Default value:
null
ssm_github_secret_path
(string
) optionalThe path in SSM to the GitHub app private key file contents or GitHub PAT token.
Default value:
""
ssm_github_webhook_secret_token_path
(string
) optionalThe path in SSM to the GitHub Webhook Secret token.
Default value:
""
timeout
(number
) optionalTime in seconds to wait for any individual kubernetes operation (like Jobs for hooks). Defaults to
300
secondsDefault value:
null
wait
(bool
) optionalWill wait until all resources are in a ready state before marking the release as successful. It will wait for as long as
timeout
. Defaults totrue
.Default value:
null
webhook
optionalConfiguration for the GitHub Webhook Server.
hostname_template
is theformat()
string to use to generate the hostname viaformat(var.hostname_template, var.tenant, var.stage, var.environment)
"
Typically something like"echo.%[3]v.%[2]v.example.com"
.
queue_limit
is the maximum number of webhook events that can be queued up for processing by the autoscaler.
When the queue gets full, webhook events will be dropped (status 500).Type:
object({
enabled = bool
hostname_template = string
queue_limit = optional(number, 1000)
})Default value:
{
"enabled": false,
"hostname_template": null,
"queue_limit": 1000
}
Context Variables
The following variables are defined in the context.tf
file of this module and part of the terraform-null-label pattern.
context.tf
file of this module and part of the terraform-null-label pattern.additional_tag_map
(map(string)
) optionalAdditional key-value pairs to add to each map in
tags_as_list_of_maps
. Not added totags
orid
.
This is for some rare cases where resources want additional configuration of tags
and therefore take a list of maps with tag key, value, and additional configuration.Required: No
Default value:
{ }
attributes
(list(string)
) optionalID element. Additional attributes (e.g.
workers
orcluster
) to add toid
,
in the order they appear in the list. New attributes are appended to the
end of the list. The elements of the list are joined by thedelimiter
and treated as a single ID element.Required: No
Default value:
[ ]
context
(any
) optionalSingle object for setting entire context at once.
See description of individual variables for details.
Leave string and numeric variables asnull
to use default value.
Individual variable settings (non-null) override settings in context object,
except for attributes, tags, and additional_tag_map, which are merged.Required: No
Default value:
{
"additional_tag_map": {},
"attributes": [],
"delimiter": null,
"descriptor_formats": {},
"enabled": true,
"environment": null,
"id_length_limit": null,
"label_key_case": null,
"label_order": [],
"label_value_case": null,
"labels_as_tags": [
"unset"
],
"name": null,
"namespace": null,
"regex_replace_chars": null,
"stage": null,
"tags": {},
"tenant": null
}delimiter
(string
) optionalDelimiter to be used between ID elements.
Defaults to-
(hyphen). Set to""
to use no delimiter at all.Required: No
Default value:
null
descriptor_formats
(any
) optionalDescribe additional descriptors to be output in the
descriptors
output map.
Map of maps. Keys are names of descriptors. Values are maps of the form
\{<br/> format = string<br/> labels = list(string)<br/> \}
(Type isany
so the map values can later be enhanced to provide additional options.)
format
is a Terraform format string to be passed to theformat()
function.
labels
is a list of labels, in order, to pass toformat()
function.
Label values will be normalized before being passed toformat()
so they will be
identical to how they appear inid
.
Default is{}
(descriptors
output will be empty).Required: No
Default value:
{ }
enabled
(bool
) optionalSet to false to prevent the module from creating any resources
Required: NoDefault value:
null
environment
(string
) optionalID element. Usually used for region e.g. 'uw2', 'us-west-2', OR role 'prod', 'staging', 'dev', 'UAT'
Required: NoDefault value:
null
id_length_limit
(number
) optionalLimit
id
to this many characters (minimum 6).
Set to0
for unlimited length.
Set tonull
for keep the existing setting, which defaults to0
.
Does not affectid_full
.Required: No
Default value:
null
label_key_case
(string
) optionalControls the letter case of the
tags
keys (label names) for tags generated by this module.
Does not affect keys of tags passed in via thetags
input.
Possible values:lower
,title
,upper
.
Default value:title
.Required: No
Default value:
null
label_order
(list(string)
) optionalThe order in which the labels (ID elements) appear in the
id
.
Defaults to ["namespace", "environment", "stage", "name", "attributes"].
You can omit any of the 6 labels ("tenant" is the 6th), but at least one must be present.Required: No
Default value:
null
label_value_case
(string
) optionalControls the letter case of ID elements (labels) as included in
id
,
set as tag values, and output by this module individually.
Does not affect values of tags passed in via thetags
input.
Possible values:lower
,title
,upper
andnone
(no transformation).
Set this totitle
and setdelimiter
to""
to yield Pascal Case IDs.
Default value:lower
.Required: No
Default value:
null
labels_as_tags
(set(string)
) optionalSet of labels (ID elements) to include as tags in the
tags
output.
Default is to include all labels.
Tags with empty values will not be included in thetags
output.
Set to[]
to suppress all generated tags.
Notes:
The value of thename
tag, if included, will be theid
, not thename
.
Unlike othernull-label
inputs, the initial setting oflabels_as_tags
cannot be
changed in later chained modules. Attempts to change it will be silently ignored.Required: No
Default value:
[
"default"
]name
(string
) optionalID element. Usually the component or solution name, e.g. 'app' or 'jenkins'.
This is the only ID element not also included as atag
.
The "name" tag is set to the fullid
string. There is no tag with the value of thename
input.Required: No
Default value:
null
namespace
(string
) optionalID element. Usually an abbreviation of your organization name, e.g. 'eg' or 'cp', to help ensure generated IDs are globally unique
Required: NoDefault value:
null
regex_replace_chars
(string
) optionalTerraform regular expression (regex) string.
Characters matching the regex will be removed from the ID elements.
If not set,"/[^a-zA-Z0-9-]/"
is used to remove all characters other than hyphens, letters and digits.Required: No
Default value:
null
stage
(string
) optionalID element. Usually used to indicate role, e.g. 'prod', 'staging', 'source', 'build', 'test', 'deploy', 'release'
Required: NoDefault value:
null
tags
(map(string)
) optionalAdditional tags (e.g.
{'BusinessUnit': 'XYZ'}
).
Neither the tag keys nor the tag values will be modified by this module.Required: No
Default value:
{ }
tenant
(string
) optionalID element (Rarely used, not included by default). A customer identifier, indicating who this instance of a resource is for
Required: NoDefault value:
null
Outputs
metadata
Block status of the deployed release
metadata_action_runner_releases
Block statuses of the deployed actions-runner chart releases
webhook_payload_url
Payload URL for GitHub webhook
Dependencies
Requirements
terraform
, version:>= 1.3.0
aws
, version:>= 4.9.0
helm
, version:>= 2.0
kubernetes
, version:>= 2.0, != 2.21.0
Providers
aws
, version:>= 4.9.0
Modules
Name | Version | Source | Description |
---|---|---|---|
actions_runner | 0.10.1 | cloudposse/helm-release/aws | n/a |
actions_runner_controller | 0.10.1 | cloudposse/helm-release/aws | n/a |
eks | 1.5.0 | cloudposse/stack-config/yaml//modules/remote-state | n/a |
iam_roles | latest | ../../account-map/modules/iam-roles | n/a |
this | 0.25.0 | cloudposse/label/null | n/a |
Resources
The following resources are used by this module:
Data Sources
The following data sources are used by this module:
aws_eks_cluster_auth.eks
(data source)aws_ssm_parameter.docker_config_json
(data source)aws_ssm_parameter.github_token
(data source)aws_ssm_parameter.github_webhook_secret_token
(data source)
References
- cloudposse/terraform-aws-components - Cloud Posse's upstream component
- alb-controller - Helm Chart
- alb-controller - AWS Load Balancer Controller
- actions-runner-controller Webhook Driven Scaling
- actions-runner-controller Chart Values
Changelog
Release 1.470.1
Components PR #1077
Bugfix:
- Fix templating of document separators in Helm chart template. Affects users who are not using
running_pod_annotations
.
Release 1.470.0
Components PR #1075
New Features:
- Add support for scheduled overrides of Runner Autoscaler min and max replicas.
- Add option
tmpfs_enabled
to have runners use RAM-backed ephemeral storage (tmpfs
,emptyDir.medium: Memory
) instead of disk-backed storage. - Add
wait_for_docker_seconds
to allow configuration of the time to wait for the Docker daemon to be ready before starting the runner. - Add the ability to have the runner Pods add annotations to themselves once they start running a job. (Actually released in release 1.454.0, but not documented until now.)
Changes:
- Previously,
syncPeriod
, which sets the period in which the controller reconciles the desired runners count, was set to 120 seconds inresources/values.yaml
. This setting has been removed, reverting to the default value of 1 minute. You can still set this value by setting thesyncPeriod
value in thevalues.yaml
file or by settingsyncPeriod
invar.chart_values
. - Previously,
RUNNER_GRACEFUL_STOP_TIMEOUT
was hardcoded to 90 seconds. That has been reduced to 80 seconds to expand the buffer between that and forceful termination from 10 seconds to 20 seconds, increasing the chances the runner will successfully deregister itself. - The inaccurately named
webhook_startup_timeout
has been replaced withmax_duration
.webhook_startup_timeout
is still supported for backward compatibility, but is deprecated.
Bugfixes:
- Create and deploy the webhook secret when an existing secret is not supplied
- Restore proper order of operations in creating resources (broken in release 1.454.0 (PR #1055))
- If
docker_storage
is set anddockerdWithinRunnerContainer
istrue
(which is hardcoded to be the case), properly mount the docker storage volume into the runner container rather than the (non-existent) docker sidecar container.
Discussion
Scheduled overrides
Scheduled overrides allow you to set different min and max replica values for the runner autoscaler at different times.
This can be useful if you have predictable patterns of load on your runners. For example, you might want to scale down
to zero at night and scale up during the day. This feature is implemented by adding a scheduled_overrides
field to the
var.runners
map.
See the Actions Runner Controller documentation for details on how they work and how to set them up.
Use RAM instead of Disk via tmpfs_enabled
The standard gp3
EBS volume used for EC2 instance's disk storage is limited (unless you pay extra) to 3000 IOPS and
125 MB/s throughput. This is fine for average workloads, but it does not scale with instance size. A .48xlarge
instance could host 90 Pods, but all 90 would still be sharing the same single 3000 IOPS and 125 MB/s throughput EBS
volume attached to the host. This can lead to severe performance issues, as the whole Node gets locked up waiting for
disk I/O.
To mitigate this issue, we have added the tmpfs_enabled
option to the runners
map. When set to true
, the runner
Pods will use RAM-backed ephemeral storage (tmpfs
, emptyDir.medium: Memory
) instead of disk-backed storage. This
means the Pod's impact on the Node's disk I/O is limited to the overhead required to launch and manage the Pod (e.g.
downloading the container image and writing logs to the disk). This can be a significant performance improvement,
allowing you to run more Pods on a single Node without running into disk I/O bottlenecks. Without this feature enabled,
you may be limited to running something like 14 Runners on an instance, regardless of instance size, due to disk I/O
limits. With this feature enabled, you may be able to run 50-100 Runners on a single instance.
The trade-off is that the Pod's data is stored in RAM, which increases its memory usage. Be sure to increase the amount of memory allocated to the runner Pod to account for this. This is generally not a problem, as Runners typically use a small enough amount of disk space that it can be reasonably stored in the RAM allocated to a single CPU in an EC2 instance, so it is the CPU that remains the limiting factor in how many Runners can be run on an instance.
You must configure a memory request for the runner Pod
When using tmpfs_enabled
, you must configure a memory request for the runner Pod. If you do not, a single Pod would
be allowed to consume half the Node's memory just for its disk storage.
Configure startup timeout via wait_for_docker_seconds
When the runner starts and Docker-in-Docker is enabled, the runner waits for the Docker daemon to be ready before
registering marking itself ready to run jobs. This is done by polling the Docker daemon every second until it is ready.
The default timeout for this is 120 seconds. If the Docker daemon is not ready within that time, the runner will exit
with an error. You can configure this timeout by setting wait_for_docker_seconds
in the runners
map.
As a general rule, the Docker daemon should be ready within a few seconds of the runner starting. However, particularly
when there are disk I/O issues (see the tmpfs_enabled
feature above), the Docker daemon may take longer to respond.
Add annotations to runner Pods once they start running a job
You can now configure the runner Pods to add annotations to themselves once they start running a job. The idea is to
allow you to have idle pods allow themselves to be interrupted, but then mark themselves as uninterruptible once they
start running a job. This is done by setting the running_pod_annotations
field in the runners
map. For example:
running_pod_annotations:
# Prevent Karpenter from evicting or disrupting the worker pods while they are running jobs
# As of 0.37.0, is not 100% effective due to race conditions.
"karpenter.sh/do-not-disrupt": "true"
As noted in the comments above, this was intended to prevent Karpenter from evicting or disrupting the worker pods while they are running jobs, while leaving Karpenter free to interrupt idle Runners. However, as of Karpenter 0.37.0, this is not 100% effective due to race conditions: Karpenter may decide to terminate the Node the Pod is running on but not signal the Pod before it accepts a job and starts running it. Without the availability of transactions or atomic operations, this is a difficult problem to solve, and will probably require a more complex solution than just adding annotations to the Pods. Nevertheless, this feature remains available for use in other contexts, as well as in the hope that it will eventually work with Karpenter.
Bugfix: Deploy webhook secret when existing secret is not supplied
Because deploying secrets with Terraform causes the secrets to be stored unencrypted in the Terraform state file, we give users the option of creating the configuration secret externally (e.g. via SOPS). Unfortunately, at some distant time in the past, when we enabled this option, we broke this component insofar as the webhook secret was no longer being deployed when the user did not supply an existing secret. This PR fixes that.
The consequence of this bug was that, since the webhook secret was not being deployed, the webhook did not reject unauthorized requests. This could have allowed an attacker to trigger the webhook and perform a DOS attack by killing jobs as soon as they were accepted from the queue. A more practical and unintentional consequence was if a repo webhook was installed alongside an org webhook, it would not keep guard against the webhook receiving the same payload twice if one of the webhooks was missing the secret or had the wrong secret.
Bugfix: Restore proper order of operations in creating resources
In release 1.454.0 (PR #1055), we reorganized the RunnerDeployment template in the Helm chart to put the RunnerDeployment resource first, since it is the most important resource, merely to improve readability. Unfortunately, the order of operations in creating resources is important, and this change broke the deployment by deploying the RunnerDeployment before creating the resources it depends on. This PR restores the proper order of operations.