Decide on Incident Ruleset
Context and Problem Statement
We need to decide the rules that make an alert an incident. This ruleset could be based on priority-level of the alert, message, or by tag.
Opsgenie can escalate an alert into an incident, this marks the alert as more severe and needs more attention than a standard alert. See How to Implement Incident Management with OpsGenie for more details on what an Incident is.
Picking a standard here provides a clear understanding to when an alert should become an incident, ideally this is not customized by each team.
Considered Options
Option 1 - Priority Level Based (P1 & P2) (Recommended)
Recommended because maps 1-1 with Datadog Severity and provides a clear understanding
Pros
-
Priority is a first-class field in Datadog and Opsgenie
-
Directly maps to Datadog severity level in monitors.
-
P1 & P2 Are considered Critical and High priority, allowing slight variation in the level of incidents.
-
Dynamic based on the Monitoring Platform (e.g. Datadog can say if this alert happens 5x in 1 min, escalate priority)
Option 2 - Priority Level Based (Other)
This could be only P1 or any range.
Pros
-
Directly maps to Datadog severity level in monitors.
-
Dynamic based on the Monitoring Platform (e.g. Datadog can say if this alert happens 5x in 1 min, escalate priority)
Option 3 - Tag Based
Tag based approach would mean any monitor that sends an alert with a tag incident:true
becomes an incident.
Pros
- Dynamic based on the Monitoring Platform (e.g. Datadog can say if this alert happens 5x in 1 min, escalate priority)
Cons
-
Incidents can now be defined in more than one way
-
An extra field must be passed
-
Puts definition of an incident on the monitoring platform.