Implement Telemetry
Monitoring is a key component of any production system. It is important to have visibility into the health of your system and to be able to react to issues before they become problems.
The Problem
Monitoring is a difficult problem to solve. There are many different tools and services that can be used to monitor your system. It is important to have a consistent approach to monitoring that can be applied across all of your systems.
There is often a tradeoff between the cost of monitoring and the value it provides. It is important to have a monitoring solution that is cost effective and provides value to your organization. Another problem is when monitoring is configured incorrectly and causes more problems than it solves, usually seen through ignored alerts or no alerts at all.
Our Solution
We have developed a set of Terraform modules that can be used to deploy a monitoring solution for your system. These modules are designed to be used with Datadog. Datadog is a monitoring service that provides a wide range of features and integrations with other services.
We have broken down the monitoring solution into several components to make it easier to deploy and manage.
Implementation
Foundation
datadog-configuration
: This is a utility component. This component expects Datadog API and APP keys to be stored in SSM or ASM, it then copies the keys to SSM/ASM of each account this component is deployed to. This is for several reasons:- Keys can be easily rotated from one place
- Keys can be set for a group and then copied to all accounts in that group, meaning you could have a pair of api keys and app keys for production accounts and another set for non-production accounts. This component is required for all other components to work. As it also stores information about your Datadog account, which other components will use, such as your Datadog site url, along with providing an easy interface for other components to configure the Datadog provider.
datadog-integration
: This component is the core component binding Datadog to AWS, this component is deployed to every account and sets up all the Datadog Integration tiles with AWS. This is what provides the majority of your metrics to AWS!datadog-lambda-forwarder
: This component is an AWS Lambda function that ships logs from AWS to Datadog. Details of it can be found heredatadog-monitor
: This component deploys monitors via yaml configuration. When you vendor in this component you will find our catalog of pre-built monitors. We deploy this component to every account, our monitors have Terraform interpolation to allow you to set the thresholds for each monitor. This allows you to set different thresholds per stage using the same monitors but different configurations using familiar atmos inheritance.
EKS
datadog-agent
: This component deploys the Datadog agent on EKS, it also deploys the Datadog Cluster Agent, the agent is a daemonset that runs on every node in your cluster (with the exception of fargate (serverless) nodes). This component handles sending Kubernetes metrics, logs, and events to Datadog. This component also can deploy the Datadog Cluster Checks which are a way to run checks on your cluster from within the cluster itself, this is often a cheaper way than Synthetic Monitoring to monitor services in your cluster.datadog-private-location-eks
: This component deploys a private location for Synthetic Monitoring to your EKS cluster. This allows synthetic checks to run even inside a private cluster.
ECS
ecs-service
: This component contains variables that enable Datadog integration with ECS. For more information on how to deploy a service to ecs, see the ecs-service component, specifically thedatadog_agent_sidecar_enabled
variable.datadog-private-location-ecs
: This component deploys a private location for Synthetic Monitoring to your ECS cluster. This allows synthetic checks to run against your ECS cluster.
Additional
datadog-logs-archive
: This component creates a single log archive pipeline for each AWS account. Using this component you can setup multiple logs archive rules.datadog-synthetics
: This component deploys Datadog synthetic checks, which are external health checks for your services, similar to Pingdom.
References
- How to create a Synthetic and SLO
- How to Monitor a new Service
- How to Pass Tags Along to Datadog
- How to Provision and Tune Datadog Monitors by Stage
- How to Setup Datadog Cluster Checks and Network Monitors for External URLs of Applications
- How to Sign Up for Datadog?
- How to use Datadog Metrics for Horizontal Pod Autoscaling (HPA)
- Datadog Log Filtering