Here is the general format of Postmortems writeups that we use at Cloud Posse.
Incident #1 - Descriptive Name of Incident (2018-06-06)
|On-call Engineer||John Doe|
|Start Time||2018-06-06 12:58 PST|
|End Time||2018-06-06 13:55 PST|
|Total Duration||39 Minutes|
|Affected Systems||Production Web Tier|
A brief explanation of what happened and what went wrong.
This should read like an executive summary of the problem. It should include:
- When did it start, end and total duration
- It should be entirely blameless - do not assign blame to individuals, only to processes
- List affected systems and if customers were affected
Explain why the outage happened. Ask “why” for each reason given.
- Reason 1
- Reason 2
- Reason 3
- Reason 4
- Reason 5
Explain what is not understood, even after the post-mortem.
List of actions performed to resolve the problem:
- E.g. rolled back site to the previous version
What did not help/work:
These attempted remediations had no positive impact.
- E.g. rebooting nodes, restarting processes
List short-term planned fixes with links to the ticketing system. These are fixes that should be carried out ASAP.
- Fix 1
- Fix 2
- Fix 3
- Fix 4
- List other suggestions to reduce/mitigate/eliminate the root cause
- Include screenshots from monitoring systems (E.g. Pingdom, Datadog, Prometheus, etc)
Related Post Mortems
- List any related Post Mortems
- Include links to relevant documentation, GitHub Issues, etc.