Skip to main content
Version 1 Documentation
This is Version 1 documentation for the Cloud Posse Reference Architecture. To determine which version you're using, please see Version Identification. To understand why this version changed, please see the blog post.
Version: v1

Decide on Incident Ruleset

Context and Problem Statement

We need to decide the rules that make an alert an incident. This ruleset could be based on priority-level of the alert, message, or by tag.

Opsgenie can escalate an alert into an incident, this marks the alert as more severe and needs more attention than a standard alert. See How to Implement Incident Management with OpsGenie for more details on what an Incident is.

info

Picking a standard here provides a clear understanding to when an alert should become an incident, ideally this is not customized by each team.

Considered Options

tip

Recommended because maps 1-1 with Datadog Severity and provides a clear understanding

Pros

  • Priority is a first-class field in Datadog and Opsgenie

  • Directly maps to Datadog severity level in monitors.

  • P1 & P2 Are considered Critical and High priority, allowing slight variation in the level of incidents.

  • Dynamic based on the Monitoring Platform (e.g. Datadog can say if this alert happens 5x in 1 min, escalate priority)

Option 2 - Priority Level Based (Other)

This could be only P1 or any range.

Pros

  • Directly maps to Datadog severity level in monitors.

  • Dynamic based on the Monitoring Platform (e.g. Datadog can say if this alert happens 5x in 1 min, escalate priority)

Option 3 - Tag Based

Tag based approach would mean any monitor that sends an alert with a tag incident:true becomes an incident.

Pros

  • Dynamic based on the Monitoring Platform (e.g. Datadog can say if this alert happens 5x in 1 min, escalate priority)

Cons

  • Incidents can now be defined in more than one way

  • An extra field must be passed

  • Puts definition of an incident on the monitoring platform.

References