Put an end to firefighting with Proactive Problem Management

By Tim Roberts, Christoph Goldenstern

In any organization, whether it is in service delivery or in operations, there typically is a system for reviewing major incidents after the incident has initially been dealt with. It may be called a Post Incident Review (PIR) or an After-Action-Review (a term often applied in the manufacturing environment). The organization may use more than one approach and different organizational silos may have different methodologies.

Problem management can be reactive or proactive

Reactive problem management aims to find and eliminate the cause of known incidents. Proactive problem management, taking a more holistic view, looks beyond the incident to identify and prevent future incidents from occurring through identification and elimination of (systemic) root causes that keep causing similar types of incidents.

Proactive problem management accounts for the incident and the cause(s) that triggered it, as well as the broader conditions that made the incident possible, allowing all contributing factors to be analyzed in more depth. A crucial step once this analysis is complete is to carefully place preventive actions that are designed to mitigate specific incidents known to create significant negative impact.

If you review the problems that give your organization the greatest aggravation, you’re likely to find they are recurring incidents which are usually addressed in the short term with a workaround or “patch” but never fully resolved. The team never gets to the “structural causes” and hence never puts measures in place to mitigate the situation in the long-term.

A benefit of proactive problem management is the ability to focus on eliminating these systemic, recurring incidents. This methodology is typically applied:

During a postmortem review of major outages/incidents
During live facilitation addressing major incidents
As a component of a Continuous Service Improvement program

Organizations that take an active approach towards Continuous Service Improvement will find that this process can identify several opportunities for improving both business processes and technology. Having experienced a major incident or outage, it is highly recommended a post-mortem review is conducted involving major stakeholders.

What Is Incident Mapping?

At Kepner-Tregoe, Proactive Problem Management is built around a core tool we call “Incident Mapping” – a means of visualizing the incident and mapping out “the event”.

The left side of figure 1 (see PDF version) provides a snapshot of typical incident review documentation used to facilitate a live meeting to understand a major incident. It is not unusual for this document to run up to five or six pages of closely packed text, recording observations, actions and communications over the entire time frame of the incident. The result is a jumble of detailed information creating difficulty in separating relevant from the irrelevant, the important from unimportant.

On the right side of Figure 1, the entire five-page narrative describing the incident is represented in a visual flowchart commonly known as an Incident Map.

It should be readily apparent that the map is much simpler and more practical for the individuals managing the incident. It also is a far superior way to report the incident to senior management who rarely have the time or tolerance for dense, detailed documentation.

When effectively done, this visualization is the product of a process that systematically describes the incident in terms of:

The main problem and its impact
All cause-effect chain(s) that triggered it
The circumstances contributing to the incident’s effect — describing why the impact was or wasn’t as bad as it could have been
Barriers that have been breached— measures that could have interrupted the cause-effect causality, and why they didn’t work
Actions that were taken
Actions that are proposed, chosen or implemented to prevent the issue’s recurrence
Actions to mitigate risks from suggested changes (so as to not cause new, separate incidents!)

Figure 2 (see PDF) provides a visualization of the incident mapping process. It is a stepwise process that begins with a description of the primary incident, the identification of the causal chain, culminating in the development and protection of recommendations to help prevent recurrence. Preparation is a crucial step in this process (shown as Step 0). A key component of this step is to ensure the individuals with the proper subject matter expertise are available to contribute.

The sequence of steps can be linked to the tools that will be familiar to individuals who have worked with Kepner-Tregoe (identified on the right) or read about its methodologies. Incident mapping is above all, a form of highly visual Situation Appraisal. If necessary, other problem-solving and decision-making tools can be used within the context of this methodology/framework.

How Incident Mapping Is Done

This analysis technique (incident mapping) is used widely in IT and Operations departments across a broad array of industries. It has multiple benefits including a quicker, clearer and more productive post-incident review (PIR) with the outputs producing a greater impact on the organization. A PIR might involve 10 to 12 people sitting around a table, typically for an hour or two. It is likely that they all were involved with the original incident in one way or another, and they are reviewing what went on, why it happened, what action was taken, and what can be done to prevent a recurrence.

Existing documentation should be reviewed prior to this review. This should include any actions or restorations that occurred during the incident itself. This data is typically recorded by someone within the support organization, often a major incident manager or a problem manager who was involved in the incident.

The traditional text-heavy version of the incident review tends to be a chronological story of how the problem unfolded. The facilitator must take the group through this document line-by-line, to secure agreement on the chronology and the description. An incident map, on the other hand provides a dynamic way to run the session, breaking the information down as follows:

Identifying a list of causal chains
Defining a set of items that should have prevented the incident
Showing the circumstances that influenced severity
Clearly identifying owners of all future actions

Many find the mapping process shown in Figure 3 more concise than a text representation as it shows the incident path by positioning a series of shapes and colors representing details and events with arrows that show relationships to one another. This presentation is a step by step visualization that tends to invite quicker consensus than can be achieved through text descriptions.

By using this map, the team can work its way from the Primary Event – the event that made the support team aware an incident had occurred – to the underlying causes, then to identifying barriers, systems or measures designed to break the links between these conditions. From there the team moves into deeper analyses or decision making to investigate an unknown cause, or select certain actions.

The problem that cause the greatest aggravation are recurring incidents which are usually addressed with a workaround but never fully resolved

Organizations have begun incorporating proactive problem management and incident mapping into their post incident or after-action reviews and, in some instances, in the management of their ongoing incidents. An example is a large and complex government agency in the UK where KT certified program leaders, who teach and facilitate problem solving within the IT service organization create a visual map on a whiteboard to document their analysis of an incident and assign follow-up actions. Individuals are always accountable for turning actions chosen into actions taken. This simple incident map provides the basis for management communication and is now the format that management expects to see when an incident is represented.

Increasingly this mapping methodology is being used in “Pre-Mortem” analysis, a type of proactive problem management that takes place prior to an incident. When a major change is being made, stakeholders ask themselves, “If we make this change, what could go wrong?” Pre-Mortem incident maps involve a lot of creative, out-of-the-box thinking. This kind of brainstorming can produce some rather large maps but the formatting provides a structure for an analysis that can be used to prevent or mitigate potential incidents.

A Bridging Technique

An incident map is a far superior way to report the incident to senior management who rarely have the time or tolerance for dense, detailed documentation.

The incident map serves as a permanent record of an incident. If it is created during the management of a live incident, the incident map records actions taken at the time and may be updated several times as additional actions are taken and barriers are put in place. This can go all the way through to the conclusion of a post-incident review. The incident map serves as an audit of people’s thinking, so that anyone can understand the journey the incident management team took to reach a root cause and a resolution.

The use of shapes and colors need not follow one specific convention, as long as everyone in the room agrees on what each shape represents. The methodology is not dependent on any specific technology, although the use of software designed to facilitate incident mapping can add considerable value to the process. Examples include flowcharting tools like Microsoft’s Visio® and CauselinkTM, a software tool aligned to the KT methodology shown below in Figure 4 (see PDF)

The use of a dynamic software tool can help take cause-mapping to the next level by:

Helping to conduct facilitations real-time
Building and leveraging a knowledge database, able to relate data elements such as evidence, notes, tasks, and solutions to each cause
Enabling reporting, action tracking, workflow tracking and search

This can be particularly relevant to industries and functions that are heavily compliance-driven and have ongoing requirements to provide documentation and audit trails for investigations.

The incident map enables the user to distill an incident down to key elements that can be easily digested by others, whether they were involved in the incident or not. It provides an effective vocabulary and a set of conventions for sense-making by seeing events in a causal chain visually.

Once mastered, incident mapping can be an effective bridge technique to take the organization from reliance on reactive problem management to a continuous improvement mindset.

Tim Roberts, Senior Consultant, Kepner-Tregoe
Tim has led international projects that support continuous improvement through the installation business processes including Root Cause Analysis, Incident Mapping and of Project Management. He has facilitated new strategy development globally at the board level then designed and coordinated a global implementation with senior management. Tim’s industry focus is within the IT service space, in particular with private and public sector clients for whom IT support is a critical function.

Christoph Goldenstern, VP Innovation & Service Excellence, Kepner-Tregoe

Christoph is a consulting leader with 20+ years of experience helping organizations in the areas of strategy, operational and service improvement. As a member of KT’s executive leadership team and global VP of Strategy and Service Excellence, he is responsible for KT’s business strategy as well as its solutions for IT Service Management and Technical Support.

Tim Roberts (troberts@kepner-tregoe.com) is located in the United Kingdom while Christoph Goldenstern (cgoldenstern@kepner-tregoe.com) is at KT’s corporate offices in the USA. Reach out to them to learn more about how these processes could be a fit for your team.

Problem management can be reactive or proactive

What Is Incident Mapping?

How Incident Mapping Is Done

A Bridging Technique

Start your training today!