Decide on Self-Hosted GitHub Runner Strategy
Problem
GitHub Actions enables organizations to run their own runners (aka workers) free-of-charge. Only problem is we have to host them and there’s a couple of ways of doing it, each with their own pros and cons. For the most part, we don’t consider it optional to deploy runners, if using GitHub Actions.
Considerations
We have prior art for the following strategies. The strategies are not mutually exclusive, but most often we see companies only implement one solution.
Considerations
GitHub Actions is the CI/CD platform offered by GitHub. Its main advantages are that, firstly, it comes at no additional costs on top of the cost of seats for users within the GitHub organization, when using self-hosted Runners. Secondly, there is a thriving ecosystem of open source Actions that can be used within GitHub Action workflows.
Self-hosted Runners on EC2
GitHub Actions Runners can be self-hosted on EC2 or on Kubernetes clusters. This is probably the easiest to deploy and understand how it works, but it’s the least optimal way of managing runners.
On EC2, the Runner executable can be installed using a user-data script, or baked into the AMI of the instance to be deployed to EC2. The runner can be deployed to an Auto Scaling Group (ASG), which usually presents a de-registration problem when the ASG scales in, however EventBridge and AWS Systems Manager can be utilized in tandem to have the runner de-register prior to being terminated (see: Cloud Posse's github-runners component which has this implementation).
It’s also possible to use Spot Instances. Easily use AWS SSM to connect to runners. Autoscale anytime sustained CPU capacity > 5% for ~5 minutes (e.g. doing anything). Scale down when CPU capacity is < 2% for 45 minutes (e.g. doing nothing). Requires a minimum of 1 node online. This is a good route if builds need to run on multiple kinds of architectures (e.g. ARM64 for M1), for which operating the kubernetes node pools would be cumbersome.
If we go this route, we’ll also want to determine if we should deploy the Datadog Monitoring Agent on the nodes.
https://github.com/cloudposse/terraform-aws-ec2-autoscale-group
Self-hosted Runners on Kubernetes
This is our recommended approach
Deploying these Runners on Kubernetes is possible using actions-runner-controller. With this controller, a small-to-medium-sized cluster can house a large number of Runners (depending on their requested Memory and CPU resources), and these Runners can scale automatically using the controller’s HorizontalRunnerAutoscaler
CRD. This has the benefit that it can scale to zero and leverages all the monitoring we have on the platform. This solution also allows for using a custom runner image without having to rebuild an AMI or modify a user-data script and re-launch instances, which would be necessary when deploying the Runners to EC2.
actions-runner-controller
also supports several various mechanisms for scaling the number of Runners: PercentageRunnersBusy
simply scales the Runners up or down based on how many of them are currently busy, without having to maintain a list of repositories used by the Runners, which would be the case in TotalNumberOfQueuedAndInProgressWorkflowRuns
. The most efficient and recommended option for horizontal auto-scaling using the actions-runner-controller
, however, is to enable the controller’s webhook server and configure the HorizontalRunnerAutoscaler
to scale on GitHub webhook events (for event name: check_run
, type: created
, status: queued
). Note that the actions-runner-controller
does not have any logic to automatically create the webhook configuration in GitHub, and hence, the webhook server needs to be exposed and configured manually in GitHub or using the GitHub API. If using aws-load-balancer-controller
, ensure that within actions-runner-controller
Helm chart, githubWebhookServer.ingress.enabled
is set to true
, and if using external-dns
, set githubWebhookServer.ingress.annotations
to include an external-dns.alpha.kubernetes.io/alias
. Finally, configure the webhook in GitHub to match the hostname and port of the endpoint corresponding to the newly-created Ingress object.
In general, Cloud Posse recommends using actions-runner-controller
over EC2-based Runners due to the flexibility in runner sizing, choice of container image, and advanced horizontal scaling options. If however, EC2 Runners need to be utilized due to specific requirements such as a build environment on ARM-based instances, then that option is recommended as well.
https://github.com/summerwind/actions-runner-controller
Repository-wide or Organization-wide Runners
Self-hosted GitHub Actions Runners can be made to be either repository-wide or organization-wide. Runners registered for a specific repository can only run for workflows corresponding to that repository, while Runners registered for an organization can run for any workflow for any repository in an organization, provided that the labels selected by the runs-on
attribute in the workflow definition match the labels corresponding to the runner. Repo-level runners have the befit of reduced scope for PAT, however, pools are not shared across repos so there are wasted resources.
In general, Cloud Posse recommends choosing Organization-wide Runners and ensuring horizontal scaling is configured to adequately respond to an influx of queued runs (see the previous section).
Labeling Runners
Whenever a GitHub Actions Runner is registered, it provides a list of labels to GitHub. Then, workflow definitions can specify which Runners to run on.
For example, if the workflow syntax specifies runs-on: [self-hosted, linux]
, then the runner must be registered with the label linux
.
In another example, if two Runners are registered, one with the labels linux
, ubuntu
, and medium
, and one with the labels linux
, ubuntu
, and large
, and workflow A specifies runs-on: [self-hosted, ubuntu]
and workflow B specifies runs-on: [self-hosted, ubuntu, large]
, then:
-
Workflow A can run on both the first runner and the second runner. It’ll run on whichever is available.
-
Workflow B can only run on the second runner.
Some advanced configurations may involve creating multiple RunnerDeployment
CRDs (actions-runner-controller
) which use different container images with different linux distributions or with different packages installed, then naming the labels accordingly. In general, Cloud Posse’s recommendation is to create meaningful runner labels that can be later referenced by developers writing GHA Workflow YAML files.
Integration with AWS
The GitHub Actions Runners often need to perform continuous integration tasks such as write to S3 or push container images to ECR. With GitHub-hosted Runners this has historically been difficult but is now made possible using GitHub’s support for OIDC, which allows for creating a trust relationship between GitHub-hosted Runners and an IAM role in one of the organization’s AWS account (preferably a dedicated automation
account). For actions-runner-controller
, this has been historically possible for a longer time now on self-hosted GitHub Actions Runners running on actions-runner-controller
using EKS cluster’s OIDC provider (see: IRSA). If using GHA Runners on EC2, an EC2 instance profile can be created, allowing the instances to assume an IAM role.