How to Setup Datadog Cluster Checks and Network Monitors for External URLs of Applications
Problem
We often want a lightweight way to ensure our endpoints remain healthy. In kubernetes this requires that a loadbalancer be setup with the right annotations and certs to allow traffic to your application. This creates dependencies on cert-manager and other platform tools. The Health Check of a kubernetes app will always use the local IP, which doesn’t really test your networking. We need a way to test that your apps are still ready to receive requests.
Solution
Cluster Checks are configured on the datadog agent, by specifying agent configuration you can set these checks to run once per cluster instead of once per node agent. These Checks test the validity of externally accessible URLs hosted in Kubernetes.
To get started follow this guide: https://docs.datadoghq.com/agent/cluster_agent/clusterchecks/?tab=helm
For helm that means ensuring the following values are set
datadog:
clusterChecks:
enabled: true
# (...)
clusterAgent:
enabled: true
External URLs
We then need to set particular helm values for each installation of the agent in each cluster. the external URL checks must be written into the agent configuration and cannot be dynamically loaded by annotations.
CloudPosse Datadog Agent now supports Cluster Checks this via this pr
Upgrade to the latest and add your network checks as yaml. This follows the same configuration as monitors. Where checks are deep merged and templated, they can be configured per environment.
Datadog Monitors
After your Cluster Checks are setup we need to create monitors for them.
Http check will verify successful HTTP checks on a URL
SSL check will verify the certificate of your URL
https-checks:
name: "(Network Check) ${stage} - HTTPS Check"
type: service check
query: |
"http.can_connect".over("stage:${stage}").by("instance").last(2).count_by_status()
message: |
HTTPS Check failed on <code>{{instance.name}}</code>
in Stage: <code>{{stage.name}}</code>
escalation_message: ""
tags:
managed-by: Terraform
notify_no_data: false
notify_audit: false
require_full_window: true
enable_logs_sample: false
force_delete: true
include_tags: true
locked: false
renotify_interval: 0
timeout_h: 0
evaluation_delay: 0
new_host_delay: 0
new_group_delay: 0
no_data_timeframe: 2
threshold_windows: { }
thresholds:
critical: 1
warning: 1
ok: 1