The Essential Guide to Incident Response

Sometimes it helps to go back to basics. Such as reminding ourselves of the point of incident response (IR). The answer is simple: to keep the business running. But that simplicity is deceptive. This is an incredibly heavy responsibility, as you know if anything has ever gone wrong in your ability to respond to a major incident.

According to Gartner, every minute your systems are down costs on average $5,600, which adds up to over $300,000 per hour. That’s a lot of money, and a. lot of pressure.

At Kepner-Tregoe, we put our collective heads together and came up with seven best practices for ensuring the success of your IR program. They include some operational, some technical, some organizational suggestions, but all of them contribute to building a first-class IR team.

Why Incident Response?

ITIL describes an incident as any interruption or disturbance to normal IT services. Let’s make it more personal to your business, and say that an incident is any circumstance in which a system behaves in a way that negatively impacts your customers. It doesn’t have to be an outright system crash. Take a slow-performing email system. Does that constitute an incident? Using our definition, you bet it does! Slow emails mean slower response to customer service enquiries, delayed reactions to requests for proposals (RFPs), slowed product development, and just about every activity your business engages in for profit.

IR is your process for responding to these incidents (and incidents are different from problems, which we will discuss later). Successful IR – which means that it’s both fast and effective – results in improved worker and process efficiency, higher productivity, and, ultimately, higher revenues for the business. It really is a mission-critical operation.

7 best practices for superb incident response

Here are seven best practices that will fine-tune your IR team to make it top-performing.

1. Communicate, communicate, communicate

There has historically been a communication chasm between IT and the rest of the organization. This raises problems when attempting to deliver great IR, because many, if not most, of your incidents will be reported by users. They must have an easy way to do this reporting, so you hear about incidents as soon as possible. Additionally, you need to keep them informed in real-time as you resolve the incident.

By implementing a user-friendly process, you establish trust, encouraging users to work more closely with you. This collaboration is essential for future incidents.

Open up multiple channels to let users raise tickets easily. For example, they should be able to alert the IR team via email, chat, a portal, or an enterprise social network like Yammer
Create self-service mechanisms so users can solve the easy incidents. Make self-service easily accessible and educate users about the benefits of self-help or using the knowledge base to resolve issues independently.

As the IR team works on fixing the incident, it’s essential to keep everyone updated on the progress in real time. There are two pieces of information that should be prominently displayed at all times: the incident status (current resolution state, including estimated time of completion), and the priority of the incident (how important it is to resolve the incident relative to other incidents.

Automation can help, by sending automatic updates throughout the lifecycle of major incidents. Clear and visible notifications will also prevent users from raising duplicate tickets and overloading the help desk. Even if there’s nothing to report, tell your stakeholders that, on an hourly or half-hourly basis. And have a dedicated line to respond to major incidents immediately and offer support to anyone affected.

2. Adopt DevOps Processes

Before DevOps became mainstream, the IR team was basically in it for themselves. They, rather than the people who had actually built the systems, were responsible for all incidents. There was no feedback loop to the developers on how to fix repetitive interruptions to a particular application, for example. There was very little communication at all between the people who built the systems, and the ones responsible for fixing them when things went wrong. Indeed, one reason that DevOps was created was to eliminate these organizational silos. This is essential because of the complexity of today’s systems. They are all interconnected, and what affects one is likely to affect others.

With a DevOps structure in place, developers do a better job in building their systems, because they now know they must also support them. No more throwing problems over the wall for another group to worry about. IR teams have support, and, typically, if DevOps is done right, clear documentation of how to keep complex systems up and running.

3. Sense when to “swarm”

Although most businesses have a “tiered” structure for dealing with incidents – Tier 1 is the help desk, Tier 2 involves application specialists, and Tier 3 are generally the system uber-experts and developers – you don’t want to universally enforce this structure when solving major incidents. You want to give your team the freedom to “swarm” when necessary.

This usually is necessary when an issue has a huge business impact. In such cases, you want to deviate from normal tiered IR processes. Swarming replaces that structure with a model of networked collaboration. It originated at Cisco, which wrote about it in its 2008 white paper, “Digital Swarming.” The concept was subsequently adopted by the Consortium for Service Innovation, and developed into a vision entitled “Intelligent Swarming..”

The general idea behind swarming is that instead of escalation, you bring everyone who might be able to help solve an incident into the IR team at the same time. There they brainstorm and bounce ideas off each other, and in general use the group dynamic to come up with fresh and innovative solutions to difficult IR issues.

Core principles of swarming include:

The “tiers” of support are eliminated
There is no escalation from one group to another. Everyone who needs to be on the team is there from the beginning
The case should be given directly to the person or persons most likely to be able to resolve it
The person who takes the case is the one who sees it through to resolution.

4. Implement a Don’t-Let-It-Happen-Again policy

You should also take care not to be putting out the same fires over and over again. This means knowing the difference between IR and problem management. IR takes care of getting things back to normal, even if that means only a temporary fix. Problem management is when you find out the root cause of the incident, and fix it.

Note that you can never eliminate incidents from occurring, that isn’t realistic. However, you can avoid having to provide fixes to the same problem repeatedly by effective problem management.

5. Get the problem statement and priority right

Probably the single most important thing you can do is understand and articulate what the incident involves. This is called incident classification, but you need to go behind putting the incident into some basic category to specifying the problem statement extremely accurately and precisely.

This should include such parameters as the system(s) impacted, the geographic location, how many internal users are impacted, and what the specific impact on business operations is.

Only when you have a clear problem statement can you set priorities. Proper classification helps in better troubleshooting and improving the resolution time. Then, prioritization ensures that the most business-critical issues are addressed first.

6. Encourage a no-blame culture

This is essential. Rather than looking to point fingers if something goes wrong—either in the IR response itself, or in the underlying issue with a system—consider simply focusing on the problem, and on finding the true root cause, whatever that might be. Having a “blame-and-shame” culture does you no good, and can even slow down IR response because people are so afraid of making mistakes.

7. Set the right KPIs and improve them

Key performance indicators (KPIs) are incredibly important because they measure how you’re doing, and give you a quantitative yardstick to use to see if you are improving. However, be careful about your KPIs. Some give false ideas of how well your IR team is performing and can cause you to prioritize the wrong things. For example, first call resolution (FCR), a common metric, measures how many incidents can be resolved with the first call. But sometimes that results in hasty decisions and actions when service quality is more important.

Therefore, set up realistic metrics and measure them for constant improvement. Here are some suggested KPIs to track:

Incident volume (per issue category, priority, status, requester, etc.)
Mean time to resolution
Mean time to respond
SLA %
Incidents resolved without escalation
Average cost per Incident
Incident reopen rate

Conclusion: Benefits of effective incident management

We all know the results of poor IR—the business suffers. Alternatively, the benefits of doing IR right are manifold. You have smooth business operations. You achieve Improved efficiency and productivity within IT team as well as the organization. You have much higher user satisfaction as you maintain your SLAs. And, as you get better at IR, you can begin proactively identifying and preventing major incidents from occurring by spotting potential major incidents before they’re reported by users or customers. That’s a big win-win-win.

About Kepner-Tregoe

Kepner-Tregoe has been the industry leader in problem-solving and service-excellence processes for more than 60 years. The experts at KT have helped companies raise their level of incident- and problem-management performance through tools, training and consulting – leading to highly effective service-management teams ready to respond to your company’s most critical issues.