Putting Together A Top-Notch IT Incident Management Team

A human error—a very basic one—caused British Airways to suffer an IT outage on May 27, 2017, forcing it to cancel more than 400 flights and leaving 75,000 passengers stranded. An engineer had disconnected a power supply at a data center, and when it was plugged back in, a power surge caused major damage. Net cost to the airline: a whopping 80 million pounds (about $102 million).

This might sound like a lot of money—and it is—but according to Statista, it’s not unusual. The average cost per hour of downtime for 86% of enterprises is more than $300,000. And the hours add up quickly.

The 2019 IT Outage Impact Study found that the typical organization experienced 10 brownouts (where infrastructure or software performs at a degraded level) or outright outages over the past three years. Those 10 incidents easily add up to millions of dollars.

Not surprising, then, 80% of companies reported that the performance and availability of their IT infrastructure tops their list of concerns. More than half worry about experiencing an outage so devastating that it will make the mainstream news. And if some such event occurs, 53% expect heads will roll—and that someone will lose his or her job.

And as much as it would be nice to simply automate responses to IT issues, “Incident response needs people, because successful incident response requires thinking,” wrote Bruce Schneier, in his blog, Schneier on Security, back in 2014. What you need: an IT (major) incident management team with clearly defined roles and responsibilities, trained to fulfill those responsibilities by following a crisis-proven process while effectively communicating with managers, customers and subject-matter experts alike.

The human side to outages

Herein lies the problem: staff and skills shortages are a significant challenge to effectively responding to incidents. Indeed, the Uptime Institute’s 2019 study is now calling the IT staffing problem a crisis. Sixty-one percent (61%) of respondents said they had difficulty retaining or recruiting staff — up from 55% the previous year.

This matters because 60% of organizations believe that their most recent significant downtime event was preventable. If they had better management, processes, or configurations, the outage could have been avoided, they say. For outages that cost more than $1 million, this figure leapt to 74%.

“By under-investing in training, failing to enforce policies, allowing procedures to grow outdated, and underestimating the importance of qualified staff, management sets the stage for a cascade of circumstances that leads to downtime,” wrote Kevin Heslin, chief editor of the Uptime Institute Journal in a September 2019 blog postabout the survey.

Staffing the IT incident management team

An incident is any unexpected event that disrupts normal operation of an IT service. IT incident management is an area of IT service management (ITSM) where the service is returned to normal ASAP. Many IT incident management teams use established ITSM frameworks such as IT infrastructure library (ITIL®) or COBIT. Others use a combination of proprietary best practices established over time.

Here are some of the most common IT incident management roles to hire and train for.

(Major) Incident managers

These people need to be “in control”. When something goes wrong, they provide immediate structure, leadership and are ultimately responsible for bringing services back to normal.

Acts as the central command for an incident
Facilitates the process, end-to-end
Manages involvement of resources
Drive the issue resolutions process and tasks SMEs with specific analyses
Produces incident reports
Performs a post-mortem on critical incidents
Adds incidents to an ongoing knowledgebase of incidents and solutions

Oversees all the processes involved in the designated incident management workflow
Ensures that incidents are resolved to the point that designated SLAs are met

Process owners

This person is responsible for the overall incident response process, including modifying it when necessary to make sure it’s aligned with business goals.

Delineates key performance indicators (KPIs) for determining how operations should function normally
Makes sure KPIs meet business goals
Designs, documents, reviews, and improves processes.
Continuously learns from incidents to adjust any aspects of the process to meet overarching business goals

Tier 1 service desk personnel

As the first point of contact when anyone—a user, customer, manager, or anyone else in the organization—reports an incident, the Tier 1 service desk is made up of people with a basic but broad working knowledge of the most common IT issues, such as password resets or printer problems as well as solutions to known issues.

Does initial data gathering, assessment and diagnosis of any service report
Acts immediately to restore a failed IT service as quickly as possible
Escalates any issues that can’t be resolved immediately to the Tier 2 service desk
Records all service requests and resolution steps taken
Keeps the person who reported the incident information about its status

Tier 2 support personnel

This level is typically staffed with people who have advanced knowledge of specific systems. Requests generally come when Tier 1 personnel escalate an issue that they can’t resolve.

Act as subject matter expert on a particular system, software, or technology
Diagnose the issue
Conduct RCA (root cause analysis)
Record everything done to resolve the incident for the knowledgebase
If the incident is resolved, confirm the resolution with person who reported it
If the incident is unresolved, escalate it to Tier 3 and/or engineering
Deliver subject matter expertise

Conclusion

According to the 2019 IT Outage Impact Study, the top-two missed opportunities to avoid outages were not identifying when systems were near capacity, and not identifying when performance—of critical hardware, software, or network components—was slowly but steadily degrading.

These are primarily people issues, which can be resolved with putting robust, but scalable processes/practices in place and training your IT staff to apply these. Questions to ask yourself when putting together your incident management team include:

Are you building IT capacity faster than hiring the resources to manage it?
Are you having difficulty hiring and retaining IT skilled workers?
Are your IT training and education programs suffering from lack of budget?

As systems are only getting more complex—especially with cloud entering the picture—outages are going to continue. But many can be avoided, and the others fixed much more quickly by putting resources behind having the right skilled employees in the right positions following proven best practices and processes.

About Kepner-Tregoe

Kepner-Tregoe has been the industry leader in problem-solving and service-excellence processes for more than 60 years. The experts at KT have helped companies raise their level of incident- and problem-management performance through tools, training and consulting – leading to highly effective service-management teams ready to respond to your company’s most critical issues.

Putting Together A Top-Notch IT Incident Management Team

The human side to outages

Staffing the IT incident management team

(Major) Incident managers

Process owners

Tier 1 service desk personnel

Tier 2 support personnel

Conclusion

About Kepner-Tregoe

Latest News & Insights

Build Trust in your CAPA Investigation

2 hours KT: Get Introduced to Kepner-Tregoe

Why Jumping to Solutions Without Finding the Root Cause Costs Organizations More in the Long Run