Skip to main content

Decide on Incident Ruleset

Context and Problem Statement

We need to decide the rules that make an alert an incident. This ruleset could be based on priority-level of the alert, message, or by tag.

Opsgenie can escalate an alert into an incident, this marks the alert as more severe and needs more attention than a standard alert. See How to Implement Incident Management with OpsGenie for more details on what an Incident is.

info

Picking a standard here provides a clear understanding to when an alert should become an incident, ideally this is not customized by each team.

Considered Options

tip

Recommended because maps 1-1 with Datadog Severity and provides a clear understanding

Pros

  • Priority is a first-class field in Datadog and Opsgenie

  • Directly maps to Datadog severity level in monitors.

  • P1 & P2 Are considered Critical and High priority, allowing slight variation in the level of incidents.

  • Dynamic based on the Monitoring Platform (e.g. Datadog can say if this alert happens 5x in 1 min, escalate priority)

Option 2 - Priority Level Based (Other)

This could be only P1 or any range.

Pros

  • Directly maps to Datadog severity level in monitors.

  • Dynamic based on the Monitoring Platform (e.g. Datadog can say if this alert happens 5x in 1 min, escalate priority)

Option 3 - Tag Based

Tag based approach would mean any monitor that sends an alert with a tag incident:true becomes an incident.

Pros

  • Dynamic based on the Monitoring Platform (e.g. Datadog can say if this alert happens 5x in 1 min, escalate priority)

Cons

  • Incidents can now be defined in more than one way

  • An extra field must be passed

  • Puts definition of an incident on the monitoring platform.

References