Monitoring for Recurring Problems: A Critical Aspect of Effective Operations

For years, progressive IT Operations have been addressing repetitive incidents by digging into their root cause and permanently resolving the underlying issue that actually caused the issue. This has led to improvement in service availability by permanently fixing issues rather than simply restoring service.

In many organizations, it takes significant effort to perform the level of analysis needed to find the root cause, resulting in a focus on only the bigger issues: the ones that are visible to external customers or could bring the business to its knees. What if it were possible to address recurring issues before they cause a failure? Structured problem analysis techniques combined with increasing the scope of event management and monitoring can help an organization be more effective in incident prevention.

Most monitoring efforts focus on identifying outages and performance degradation, where predicting a condition that could lead to an incident if not addressed and resolving it permanently can actually prevent the incident from ever occurring in the first place.

Common Monitoring Practices

There are many common areas organizations monitor:

Network and circuit status (up / down) and traffic
Load balancer operations
Server and virtual server functionality
Application performance
Security breaches
Data center environment (temperature, electric, etc.)

Where event management, like proactive problem management, has been limited in scope due to the inability to correlate too much information, data aggregation and operational intelligence capabilities are now available in many toolsets. These allow an organization to monitor virtually anything that provides data that can be collected. Consider the benefits of:

Aggregating normal patterns of data traffic over a network, such that a variation from these patterns can be detected through the use of artificial intelligence, identifying a potential service breach
Tracking application behavior against server memory, disk and CPU utilization to understand normal ranges, providing the ability to identify a potential impact due to a code change before it affects performance
Monitoring disk and table space usage so that increased data base size can be addressed before an impact is felt (in a virtual environment, this can be managed automatically, preventing any potential incidents from occurring)

The goal here is to collect as much information about the operational environment as sensors and monitoring tools can provide, then combine this with the use of artificial and operational intelligence tools to identify variations in expected results. In conjunction with a good event management tool, these can then be classified appropriately:

Critical: the variation indicates an outage of a critical system
Major: the variation indicates loss of a feature/function of a service
Minor: there is a performance or other degradation of functionality
Warning: no degradation or outage has occurred, but a threshold is being approached. Immediate intervention might mitigate an operational incident
Informational: a variation from normal operation has occurred, but it is not yet critical enough to cause a concern

Typically, Critical and Major alerts will trigger formal (major) incident management procedures and ultimately to a root cause analysis and repair if they are extensive or repetitive. This is reactive problem management and works well to eliminate costly and repetitive incidents, however, addressing repetitive Minor, Warning and Informational alerts provides the opportunity to correct the cause before the first incident ever occurs.

The challenge to many organizations lies in the ability to deal with multiple issues at a time. When an organization is struggling to address the critical and major issues in a timely manner, the thought of expanding scope is met with significant cynicism, yet in today’s operational environments it’s critical to get to this level in order to deflect the cost of operational outages.

The key is in marrying structured problem-solving techniques with the ability to use machine learning and artificial intelligence to record and categorize the issues so that IT engineers can focus their problem-solving efforts more quickly and with better data flowing into the analysis process. Kepner-Tregoe techniques, combined with expansion of the monitoring program can help an organization achieve this.

Getting Started

Achieving this is an iterative process.

Step 1: First, an operations organization needs to be able to react successfully to the critical and major incidents. Wherever possible automated responses need to be available to restore service. Only when this fails should notification to appropriate teams become necessary. Automated response not only can restore basic service issues more quickly, but it also results in more time to address the root cause and eliminate the bigger issue permanently (note: in some cases, an automated response that makes a change that mitigates the issue is the first step, and permanent resolution may be a longer term goal).

Step 2: Once the critical issues are ”in-control”, collection of data from minor alerts, warnings and informational alerts should be used to establish patterns. This is where the ability to use operational intelligence and other automatic analytic tools can help with identifying potential repetitive issues. While there may not be a need to address them immediately, they should be logged as problems, analyzed and addressed with both a temporary band aid that is automated to prevent them from leading to incidents. Where the band aid is not successful, appropriate teams should be notified to address the condition before significant incidents occurs.

Step 3: The third and final step is to look for the permanent solution to issues that have band aids applied. This means determining the cause, using structured problem analysis techniques and permanently fixing those that make sense to resolve from a financial standpoint. It’s not necessary to resolve everything: if an automated response to a minor issue prevents an incident from occurring, the automation is all that is needed.

Ultimately, the value of this exercise is to use the tools that are now available to expand the scope of a monitoring and event management practice, then use this expansion to prevent costly incidents from occurring. This level of analysis and response can not only protect the organization’s revenue stream, it can also ensure confidence in customer facing operations.

About Kepner-Tregoe

Kepner-Tregoe has been the industry leader in problem-solving and service-excellence processes for more than 60 years. The experts at KT have helped companies raise their level of incident- and problem-management performance through tools, training and consulting – leading to highly effective service-management teams ready to respond to your company’s most critical issues.

To learn more about how Kepner-Tregoe can set this up at your organization.

Monitoring for Recurring Problems: A Critical Aspect of Effective Operations

Common Monitoring Practices

Getting Started

About Kepner-Tregoe

Aktuelles & Insights

How proactive leadership drives customer value through strategic thinking

Das Prinzip von IST/IST NICHT: Eine 5-Minuten-Lektion im systematischen Problemlösen

Is Kepner-Tregoe still relevant? The Science of Clear Thinking