Skip to main content

How to Setup Datadog Cluster Checks and Network Monitors for External URLs of Applications

Problem

We often want a lightweight way to ensure our endpoints remain healthy. In kubernetes this requires that a loadbalancer be setup with the right annotations and certs to allow traffic to your application. This creates dependencies on cert-manager and other platform tools. The Health Check of a kubernetes app will always use the local IP, which doesn’t really test your networking. We need a way to test that your apps are still ready to receive requests.

Solution

TL;DR:
Use Cluster Network Checks to test your endpoints. These will test external URLs which helps ensure endpoints are healthy.

Cluster Checks are configured on the datadog agent, by specifying agent configuration you can set these checks to run once per cluster instead of once per node agent. These Checks test the validity of externally accessible URLs hosted in Kubernetes.

To get started follow this guide: https://docs.datadoghq.com/agent/cluster_agent/clusterchecks/?tab=helm

For helm that means ensuring the following values are set

datadog:
clusterChecks:
enabled: true
# (...)
clusterAgent:
enabled: true

External URLs

We then need to set particular helm values for each installation of the agent in each cluster. the external URL checks must be written into the agent configuration and cannot be dynamically loaded by annotations.

CloudPosse Datadog Agent now supports Cluster Checks this via this pr

Upgrade to the latest and add your network checks as yaml. This follows the same configuration as monitors. Where checks are deep merged and templated, they can be configured per environment.

Datadog Monitors

After your Cluster Checks are setup we need to create monitors for them.



info

Http check will verify successful HTTP checks on a URL

SSL check will verify the certificate of your URL


https-checks:
name: "(Network Check) ${stage} - HTTPS Check"
type: service check
query: |
"http.can_connect".over("stage:${stage}").by("instance").last(2).count_by_status()
message: |
HTTPS Check failed on <code>{{instance.name}}</code>
in Stage: <code>{{stage.name}}</code>
escalation_message: ""
tags:
managed-by: Terraform
notify_no_data: false
notify_audit: false
require_full_window: true
enable_logs_sample: false
force_delete: true
include_tags: true
locked: false
renotify_interval: 0
timeout_h: 0
evaluation_delay: 0
new_host_delay: 0
new_group_delay: 0
no_data_timeframe: 2
threshold_windows: { }
thresholds:
critical: 1
warning: 1
ok: 1