When a complex problem puts safety or customers at risk—when time is running out or costs are mounting—root cause analysis must be accurate and fast. Many nuclear plants have used the systematic approach to root cause analysis developed by Kepner-Tregoe for decades. The potentially high stakes of troubleshooting at nuclear power plants provides lessons for any industry. So we asked a group of expert troubleshooters from several nuclear and fossil power generators to identify those actions that improve the speed and accuracy of root cause analysis. While their experience is specific to their industry, the best practices of effective troubleshooting are shared whenever the stakes are high.
Here are four key actions gleaned from these nuclear industry troubleshooters that are relevant to any industry. They can make the difference between taking shots in the dark and hitting the bull’s-eye on the first try.
1. Think first act later.
Every troubleshooter has heard, “Do something. I don’t care what, just do something.” A senior system engineer recounted how much trouble the shot-in-the-dark approach can create. “At one time,” he explained, “our troubleshooting and root cause analysis consisted of determining every possible way a problem could have been caused and then physically dispositioning each one. We did find and fix most problems with this method, but it was very time consuming and expensive.”
He explained that after select employees from operations, maintenance, training and engineering, went through a train-the-trainer program, they began teaching root cause analysis methods back at the plant. The systematic root cause analysis quickly proved its value when a series of generator incidents meant the reactor would have to be shut down at a cost of $250,000 a day, or more, if the problem continued. Despite pressure to get the generator back in service, the troubleshooting team systematically specified the problem and looked for significant differences and changes that had occurred during some but not all generator incidents. They identified a probable cause, tested it against the problem specification and got the generator back online without incident. The new “think first, act later” approach to troubleshooting paid off.
2. Resolve One Problem At A Time
A major obstacle to successful problem solving under time pressure is failing to identify the one problem that needs to be solved. “Often a system can remain operational even if there are several ongoing problems within it,” said an engineer experienced in facilitating root cause analyses. “Then along comes a problem that disables the entire system. Under time pressure, the goal isn’t to solve all of these problems. It’s to identify and solve the one that caused the system to fail.” Before problem analysis begins, team members must agree on an accurate, specific statement of a single, top-priority problem. This provides focus when the pressure is on and time is critical.
3. Use One Process.
When the stakes are high, emotions—and adrenaline—run high. Often, when a troubleshooting team first assembles, ideas, especially about cause, are thrown up, shot down and sometimes brought up again. Without a shared systematic process for tackling problems, the team can go in circles indefinitely, wasting time and money while struggling to get a handle on the situation.
When everyone on a troubleshooting team uses the same process, order is quickly restored. Information is gathered in an orderly, step-by-step sequence. Everyone on the team is on the same page, gathering information, developing possible causes, then testing those causes to determine which is most probable and, finally, verifying true cause.
One engineer who is often called on to facilitate under emergency conditions believes the key to success in such circumstances is being firm without being dictatorial. “Let them talk for a while,” he advised. “They’ll use technical jargon, jump to cause, defend their pet theories and try to impress one another with their content knowledge. When they’ve gotten all that out of their system, then you need to lead them down the process road.”
4. Gather the right people.
Organizations often approach root cause analysis with a core team of troubleshooters supplemented by people with special expertise. The expert troubleshooters from power plants agree that assigning the right people is critical when the stakes are high.
“Sometimes managers want to assign the analysis to a separate group of people because the knowledgeable individuals are too busy managing the problem,” noted one engineer. “But if this second group doesn’t have the facts, or if the event is still unfolding, they really can’t do a proper job of root cause analysis. It often takes less than one hour to create a problem specification and test possible causes—if the right people are assigned. Even for busy or critical personnel, this is the best investment of an hour they can possibly make.”