A postmortem (or post-mortem) is a process intended to help you learn from past incidents. Some examples may include a service went down in production causing some form of customer impact, or perhaps a production deployment failed.
Postmortems typically involve an analysis or discussion soon after an event has taken place. Similar to retrospectives, they provide an excellent opportunity for teams/departments to reflect, learn and improve.
We could perhaps find a nicer word for postmortems such as “Incident Debriefing” or “Incident Review”.
Having read many stories of how the aviation industry used them to dramatically increase safety levels, and seeing first hand how Ops teams used them effectively, it was time to start using them with development teams. So this is a simple blog with some first hand observations of postmortems .
Postmortem 10 Tips
- Timely: Schedule the post mortem ASAP, while it’s Fresh in people’s heads
- Facilitator: Often the person who leads the incident response will lead to postmortem. Sometimes not being involved is a positive, as you may be more objective and neutral. A facilitator organises and runs the meeting, including the initial prep. You don’t necessarily have to have been involved in the incident response to be a facilitator for a postmortem.
- Audience. Invite all those that were involved in the incident. Also those impacted e.g members of the support team and account managers. This builds trust with others in the organisation and provides a feedback mechanism.
- Prep beforehand: Begin to document the timeline of events. This makes more efficient use of people’s time in the postmortem.
This list should include
- Investigation steps
- Communication during incident
- Recovery and mitigation steps.
- Standard Template: Use a standard template, if your company does not have one, there are plenty online.
- Positives: Don’t forget to focus on what went well
- Stay focused. The meeting focus is
- A summary of events that transpired
- How the response was handled
- What resolution steps were taken
- Preventive Follow-up actions
- Not just for major incidents: This is a Cultural shift to a learning mindset. Also use post-mortems for those near-misses
- Actions and Sharing
- Take note of the key preventive actions
- Assign to teams/individuals with datesShare the document in the relevant communication channels that your company uses
- Blameless: Focus on the root-cause, and learnings and future improvements. A suggestion is don’t name individuals rather teams/departments. This builds trust with people, which leads to better postmortems and ensures that people will regularly attend postmortems. Etsy considers blameless post-mortems as key to their success.
“…and that they can give a detailed account without fear of punishment or retribution.”
Blameless postmortems can have a huge affect on how teams can learn from both incidents and near misses, what now try running a postmortem next time you have an incident impacting customers.
My key takeaway has been you may think the root-cause of incidents are technical in nature. Although typically a breakdown in communication is typically the underlying reason. Guess that’s why the Aviation Industry and Etsy enjoyed such success with them.