“WHEN WILL YOU GET SERVICE RESTORED?” shouts the Senior Vice President again, cutting across the technical discussions on the bridge call for the second time in five minutes.
“The service restart will take forty minutes,” replies the Senior Technical Lead.
“YOU HAVE TO GET IT BACK IN TEN!”
Service Management Under Pressure
Service Management is always under pressure. During business damaging incidents, when critical systems go down and affect the minute-to-minute revenue streams — even briefly — financial losses can be huge.
Besides such major incidents, pressure also builds up from other areas:
- Increasing complexity of the environment
- Interdependency of the products they service
- Increased availability of both good and bad data
- Economic pressure, where less is more
- Customer expectations: 100% uptime — why not?
- Personal motivation
- Need for consistency and quality, as these give the business a competitive advantage
There is a tendency to put more technology in place and/or have people attend as much technical product training as possible to keep up to date. The latter seems to be the key driver for certainty and confidence in a job and is often the first action that management reaches for — more technical training. Unfortunately, technical training is only valid for a certain domain and for a short period of time. Is it possible, any longer, to keep up with the speed of technology changes? It is absolutely and clearly understood that someone in a support function needs to understand the technology, and be on hand when there is an incident, but there are other things that are essential in order to maximise effectiveness.
How to respond to these pressures?
We know from analysis and experience that the normal first response — jump to cause, retrieve information from your existing database and apply a fix — is often the correct action to take. Technical people rely on their knowledge and experience to solve simple issues. However, if this fails, following the same course of action — sticking with ‘fast thinking’ a second time around — can cost money and reputation. It is then time for some ‘slow thinking’, to gather the evidence and calmly work through the data.
There are two other crucial factors interfering with good sense and a considered response under pressure — biology and psychology.
Four Drivers That Can Help to Keep You on Top
Given the external pressures, there are four key drivers to effective performance that consultants at Kepner-Tregoe have researched and defined:
Many people believe that coping with emergency situations is a matter of in-built character, somehow genetically determined. But when we look at firemen and emergency rescue teams, they consistently approach an incident with a rehearsed and well-defined strategy.
The four aspects that relieve biological pressure and provide for a calm theatre and workplace to ensure a good quality outcome are:
1. Predictable Performance: install the right software to operate the brain hardware
When confronted with a crisis, people look for certainty. We fall back on the internal database, stored in our brain and are likely to follow our intuition. This is recommended for familiar situations, but not when faced with new or fairly new problems1. We tend to make decisions based upon pattern recognition and emotional tagging2.
Pattern recognition means that when faced with a new situation we make assumptions based on prior experiences and judgement. In short, we jump to conclusions.
Emotional tagging is the process by which emotional information attaches itself to the thoughts and experiences stored in our memories. We are likely to be biased in what to do by our experience, not by annualizing the facts.
Predictability depends on the frameworks we use. To improve a support organization and increase the quality and consistency of the support given by the engineers, individual frameworks need to be optimized. Every individual needs to use the same ‘brain operating software’; they need to talk the same problem solving language. The usage of the same software enables quick hand over and efficient understanding of the current situation. And using the same framework will also establish data quality trust, which avoids engineers redoing the work from a previous engineer.
2. Feedback: Put in place a mentoring system to ensure quality input
One aspect of Incident Management performance is the feedback obtained from a post-mortem review, also known as the Major Incident Report. Often, we find them focused on who calls who and not enough about how the incident progressed and if quality data was retrieved at the right moment in time.
Every incident has a natural process, and it mustn’t be trial and error, which can lead to even more dangerous effects. The first step should always be to keep a record of issues and their impact and consequences visible to all, as well as the investigation and resolution actions. Real-time coaching provides live, moment by moment coaching of the engineers at the transition points (arrows in the diagram above) in the lifecycle of an incident. Instant reviews performed on the quality of the data make sure that there is quality in every step, with less attention paid to the time or speed. This makes sure that the next stage in the handling of the incident gets good quality data and guides the team to the right conclusion.
Finger pointing at the guilty parties after failing is sometimes called ‘feedback’, but is really a humiliating de-motivator. In providing real-time feedback, engineers will cement ‘the operating system’ on the job, enabling them to put in place the new behavior more effectively every time. Understanding each step of the process gives them the confidence they need to act quickly: “I have a plan and know the next logical step in the problem-solving process and know how to do it!”
3. Infrastructure: Make sure stored information can be retrieved instantly
You might think about common ticketing systems as a solution. However, many fail to address key points and focus on where to store documented cases or whether the customer is contracted for service, as opposed to visualizing incidents and problems. For example, most help desk software solutions do not contain an “exact time” box: “when was it last known that the equipment or service was running alright?” The current date which is in the system is the date the ticket is raised. What impact will it have if we put the exact time (when the problem was known to be seen first) into the system?
Most current systems show a chronological cascade of data, which does not enable us to understand immediately what is going on and what information is missing. With a process in place and using good quality templates for information gathering, the engineering team will improve the visibility on the incident3. They are not working in the foggy depths of a cloud anymore (although they might be working using cloud enabled software).
4. Channels: Show certainty about the strategy
The person shouting for his system needs to get an answer. But what is he actually looking for? The standard text messages, repeated every hour with the same text, do not give confidence to the business about the actions being taken. If the support organization can provide the key stakeholders with quality information about where they are in the troubleshooting process, their emotional states will be much more relaxed and clear thinking can be facilitated. The engineer can rely on the agreed framework and he has a strategy to work from. “I understand your concern, and if I knew the solution, I would immediately put it in place. What I first need to do is to gather some information, to check the symptoms and observations made by the team, and come to a factual understanding.”
It is visibility of a well-managed incident in this cloud of text messages that management expects. What is the progress towards the problem description? How many possible causes have been evaluated? Again, we can report the activity currently underway simply because we have the framework and stage gates.
KT was on site when a P1 incident kicked off in the network operations of a mobile phone company. The Incident Manager took control, but just fifteen minutes into the call was observed with nothing to do. “What is happening?” “Fred is working the P1; he’s at home.” “What is actually going on?” “No idea — he’ll let us know when he’s done.” Twenty-five minutes later, Fred called in with “I’ve taken a look and it’s definitely nothing to do with the stuff I support.” Control? Lost. Confidence? None.
The four drivers enable the organization to create a calm theatre in which to solve the problems permanently. It is a theatre where expectations are set, frameworks are in place to guide clear thinking and support is ready when needed.
Good decisive actions are always preceded by clear thinking.
 Kahneman, D. 2011 “Thinking fast and slow”. Penguin Group
 Campbell, A. Whitehead, J., Finkelstein, S. 2009. “Why Good Leaders Make Bad Decisions.” Harvard Business Review. February
 Gawane, A. 2009, “The Checklist Manifesto, How To Get Things Right”. Metropolitan Books