Skip to main content

Alerting

Learn how to set up an effective alerting system that notifies the right people when issues arise. This involves configuring OpsGenie for incident management and integrating it with Datadog for monitoring, ensuring alerts are properly escalated and managed to prevent alert fatigue and system failures.

Alerting notifies users about changes in a system’s state, typically via emails or text messages sent to specific individuals or groups.

Alerting goes hand in hand with monitoring. When a system is monitored properly, only the right people are notified when something goes wrong.

important

This Quick Start assumes you have read through the Monitoring Quick Start

The Problem

When a system is not monitored properly, it's easy to get overwhelmed with alerts. This leads to alert fatigue. Alert fatigue is a real problem that can lead to a system being ignored and eventually failing.

Furthermore, a system must be in place for alerts to be escalated to the right team and people. When a single system goes down, there are often cascading effects on other services.

Our Solution

Our solution begins with OpsGenie. OpsGenie is a modern incident management platform for operating always-on services, empowering Dev and Ops teams to plan for service disruptions and stay in control during incidents.

We integrate OpsGenie with Datadog to pair it with our monitoring solution.

Implementation

We have a singular component that can be instanced for every team. This component is called opsgenie-team and it handles the surrounding work of setting up a team in OpsGenie.

To get started with this component, you'll need an OpsGenie API key, which you can get from the OpsGenie API page.

Follow the Component README to get started, this will create a catalog entry which should be configured to your company's defaults for every team. Then start by creating a team for each team in your organization.

How it works

Our Monitors have a global configurable variable called alert_tags, this should be set to include @opsgenie-{{team.name}}, such as:

alert_tags: ["@opsgenie-{{team.name}}"]

This creates a message on the Datadog monitor that uses the team tag to send the alert to the correct team in OpsGenie. When an event or alert is triggered, the data's tag of team will be used to dynamically send an alert to the corresponding team in OpsGenie. An important distinction in that the tag is not fetched from the monitor, but the data sent to the monitor.

Having the data's tag of team being used is beneficial, monitors can be configured to send alert to different teams. For example, you can have a single monitor for pods crashlooping on EKS, if each deployment is properly labeled with the team:foo or team:bar tag, then the alert will be sent to the correct team.


Service Level Indicators (SLIs) and Service Level Objectives (SLOs)

SLIs and SLOs are a way to measure the reliability of a service. They are a way to measure the quality of a service.

Sometimes a business has contractual obligations to their SLOs. For example, a business may have a contractual obligation to have 99.9% uptime. This means that the service must be available 99.9% of the time.

Datadog supports SLOs, they can be a set of metrics or monitors. For example, you can have a monitor that checks if a service is up or down. This monitor can be used as an SLO.

You can then use SLO Monitors to report Incidents in OpsGenie, when you create a team, you can specify the level of priority that is considered an Incident. When the alert matches the priority level of an Incident, an Incident is created in OpsGenie. An Incident is a specialized alert that is used to track the progress of an issue, it can have cascading effects on other services.


References

tip

This article goes more in-depth on some of the above topics.

FAQ

How do I set up SSO with OpsGenie?

There are official docs on how to configure SSO/SAML. Those should suffice for using AWS Identity Center. AWS also has docs on adding SAML applications which includes an official config for OpsGenie already. If you don't plan on using AWS IC, there are docs on configuring SSO with other identity providers.