The ongoing pandemic offers a reminder of the importance of being prepared when something big goes wrong. When the threat of Covid-19 became evident, disruptions in supply chains, lack of PPE inventory and equipment, and conflicting health policies hindered the ability to respond with optimal clarity and speed.
The costs of IT downtime can be huge
Major or high-severity incidents are those that have a large and significant impact. In organizations that rely heavily on IT systems, these incidents don’t occur too often but when they do, a rapid, planned response is critical. The cost of IT downtime can be huge. Estimated costs have ranged from $427 per minute for small businesses to $9,000/minute for medium and large companies. For ecommerce giant Amazon, a few years ago, the cost of downtime was estimated at over $220,000/minute.
Day-to-day, incident-management processes are typically effective in addressing the large volume of relatively low-impact IT incidents and service requests. The trend has been for incident-management processes to shift towards self-service, automation and asynchronous engagement with support staff (i.e., email interactions with global call centers). Service-desk personnel with limited training and technical skills can address day-to-day incident-management functions through basic diagnostics, binary decision/ knowledge trees and scripted responses. More difficult issues are routed to second- and third-tier escalation teams with technical expertise, but the goal is still to apply the least-technical and cheapest resources available to resolve the incident.
Major incidents are different than their smaller, day-to-day counterparts and require a separate approach. A normal incident typically only affects a few users. For major incidents, the cost of the impact far exceeds the cost of resolution. The key success factors are response time and quality of the response. Time is of the essence so the goal is to apply human resources who can resolve the incident the fastest to minimize business impact. These resources are typically highly trained (and highly paid) subject-matter experts with extensive experience and deep technical troubleshooting skills. Goals are to respond quickly, resolve the immediate impacts, preserve the organization’s reputation and mitigate the operational and customer risk.
Managing perceptions is critical
During an active incident, support staff and executives should rely on major incident management resources to help them take control of the end-to-end process and guide their activities through:
- Understanding the incident and symptoms
- Mitigating the impact and managing risks
- Making sure decisions are visible and data-driven
- Assessing possible causes (if necessary)
- Managing perceptions and expectations
- Returning to normal
Managing major incidents poorly can be disastrous. Controlling the flow of communications and managing perceptions are critical to major incident management. If the official messages from the major incident-management team are not clear and timely, there is a risk that misinformation will overpower the official messages, resulting in greater confusion and a negative customer experience.
In addition to overall technical and performance impact, major incident activities often extend across business function boundaries, causing decision-making authority issues to arise. This is a high-stakes environment where management must weigh the expected outcomes of certain actions against their risks. This not only requires clear, accessible data of what is known but also what is not known. A major incident management process should include cross-functional, decision-making guidelines to avoid delays and confusion while an active major incident is occurring.
Don’t stop once the incident is managed
The challenges of major incident management don’t end when service is restored. As with normal incident management processes, the primary objective during a “live,” major incident is to mitigate impact and take corrective action to return the business to normal operations. Now problem management kicks in and the root cause needs to be fully understood. Identifying root cause and implementing actions to prevent the issue from re-occurring can be challenging. Amid the confusion of managing the active major incident, critical diagnostic information is often lost or destroyed, impeding root-cause identification. To achieve true IT stability, an integrated, major incident and problem management process is needed to secure and document critical “cause information” and ensure that service improvement continues.
The costs and threats of IT downtime can be huge. An investment in major incident response is critical to maintaining IT stability and ongoing business success.
About Kepner-Tregoe
Software and templates don’t solve problems. People solve problems!
What kind of people? People who are curious, ask great questions, make decisions based on facts, and are empowered to lead. They remain focused under pressure and act confidently to do what needs to be done. You’ll find these problem-solving leaders both at our clients and here at Kepner-Tregoe. For over 60 years, Kepner-Tregoe has empowered thousands of companies to solve millions of problems. If we can save millions for a manufacturer, restore IT service for a stock exchange, and help Apollo 13 get back from space, we can help your business achieve success.