×
Wheel of Misfortune is a game that aims to build confidence to oncall engineers via simulated
outage scenarios.
With the game, you practice problem debugging under stress, the understanding of the incident
management protocol, and effective communication with other engineers
of your team and organization. It is a great way to train new hires, interns, and seasoned
engineers to become well-rounded oncall engineers.
Terminology
- Scenario: A past or fictional incident case.
- Game Master: The host-coordinator of the session.
- Volunteer: The trainee oncall engineer.
Feel free to fork the repository or download the stable release.
Insert your incident scenarios into the general_incidents.json
file inside the incidents/
folder. The file has the following format:
title |
the title of the incident. |
scenario |
the description of the incident. It is useful to include URLs from monitoring
systems, dashboards, time-series databases and playbooks. |
difficulty |
the difficulty level of the outage. |
ID |
the unique ID of the outage (you can just auto-increment). |
Game Master
- Choose a volunteer to be the primary oncall engineer in front of the group.
- Find a balance between volunteer's experience and incident's difficulty.
- Assist volunteer by answering questions that may arise in each theoritical action or
dashboard observation.
- Engage with the rest of the team and ask for different ways to debug the problem
following volunteer's explanation.
- Team members may be made available over time for assistance in various topics.
- At the end, have a debrief on the learnings of the session.
Volunteer
- Spin the wheel and attempt to fix the theoretical outage scenario.
- Explain to the Game Master and the rest of the group, what actions you would take (lookup
queries, checks in dashboards, etc.) to find the root causes, and eventually solve the
incident.
- Always keep an eye on the time, since it is simulated incident response scenario and not a
routine troubleshooting process. During a real incident you might have an SLA or SLO
breach and therefore, you should take timing into account.
- Engage with the rest of the group. Keep them in the loop. Ask questions to different
members depending on their expertise.
Most importantly, have fun!
You can read a comprehensvie example on how to conduct the exercise here.