Gestión de calidad y validación con Metrohm

Incident

"Any event that deviates from the (expected) standard operation of a system."

An incident is often simply a user requesting help for something that is not working. For example “I can’t see my network drive”, “I can’t access the Internet”, “I can’t send email”. It is any situation where something does not work and the specific details are not known.

Work around

It is possible for Problem Management to identify “work-around” in the investigation of problems. These should be made known to Incident Management so that they can be passed to the user until the permanent fix is implemented.

8.0 Problem Management

8.1 Introduction

Problems have a tendancy to always happen !!. No matter how well things are running. Even with the most reliable IT, the service delivery will be troubled by disruptions that cannot always be avoided.

We have learnt that an incident is a deviation from standard operation. This means that users can face many incidents and a lot of the time they will face the same incident many times.

A user calls with an "incident" - the Service Desk captures the call and gives great Incident Management support : "re-boot your PC and see if that fixes it". It does. The user is happy. The next day the same user calls with the same incident, with the same great incident support. "Re- boot your PC". On the third day the user does not call again, they just re-boot their PC and start to live with the issue. Then they start to tell other users - just re-boot your PC that'll fix it. All of a sudden we have a plague of PC re-booting users !!!

So how do we avoid the plague? By introducing Problem Management. The first incident that was fixed by re-booting the PC should have been passed to Problem Management.

An example where Problem Management can make a difference:

In an organization a user calls the Service Desk with the complaint that his document is not printing. The Service Desk investigates the incident and sees that the print queue has the status ‘On Hold’. The Service Desk releases the queue, the document is printed and the Incident closed. A few minutes later another users calls the Service Desk … Their document is not printing .. As this is a different person answering the phone he investigates the Incident and sees that the queue is on Hold, releases the queue and the incident is closed..

In the mean time users print their documents over and over again as they think they’ve done something wrong. The next day users still call with the same problem, the document still not printing … The Service Desk releases the queue when applicable, the document is printed and Incident is closed.

If Problem Management were in place a problem would have been identified and recorded. The "Known Error" related to this problem would be found in the configuration of the Printer. The solution, to reconfigure the printer so the queue is automatically released, would be found and implemented. The stream of Incidents regarding this printer would cease.

The releasing of the queue by the Service Desk would be used as a workaround to restore the IT service in the event of the printer facing a similar issue in the future.

8.2

Objective

The objective of Problem Management is to minimize the total impact of problems on the organization. Problem Management plays an important role in the detection and repair of problems to prevent their reoccurrence.

The following slide says this in a different way but also introduces the crucial element of proactive problem management.

8.3

Process Description

The Problem Management process is focused on finding weaknesses in the IT infrastructure and through the use of Change Management removing them so that future disruptions do not occur. The process focuses on finding patterns between incidents, problems and known errors. These three areas are key things to understand in this "root cause analysis". The basic principle is starting with many possibilities and narrowing down to a final root cause.

Note: "Root Cause analysis" is often used interchangeably with Problem Management. The ITIL Framework doesn't prescribe what a process area should be called and Root Cause Analysis is fine. However, Root Cause Analysis is typically a reactionary exercise. ITIL's Problem

Management caters for reactive work, but more importantly recognizes the value of proactive problem management. We use Root cause analysis interchangeably with Problem Management.

Incidents:

An incident is defined as a deviation from the standard expected operation of a service. It is a general description of something that has gone wrong. It is not known what the exact cause is at this stage. For example users will call a Help Desk and say, “I can’t print”, “I can’t access the Internet”, “I can’t see my network drive”. They expected to be able to do these things yet could not, so these are "incidents".

Problem:

A problem is the “unknown underlying cause of one or more incidents”. This is the second stage of "root cause analysis"/problem management. From the general incidents, more investigation will uncover an underlying cause of these incidents. A “network problem” is a good example of a problem definition in this case. Users don't call saying I have a "network problem", they call and say "I can't save to my H: drive" or "I can't print or surf the web". IT staff then piece all these incidents together and identify that we are facing a "network problem".

Root cause analysis has taken us closer to finding the root cause but not completely. A problem is then a more specific definition.

Known Error:

A Known Error is the final step in the root cause analysis process. A Known Error can be defined as, “when the root cause of the problem is known”. In our network problem example it is where the faulty equipment or system has been identified.

This is the end of the root cause analysis process. Following the above example the Known error would be “Router x is faulty”.

From the above we see the initial general issues being faced through to the final definition of the root cause. The following diagram illustrates this flow.

The process of Problem Management requires the following inputs: • Incident records and details about Incidents

• Known Errors

• Information about C.I.’s from the CMDB

• Information from other processes (eg. Service Level Management provides information about required timeframes, Change Management provides information about recent changes that may be part of identifying the known error).

The outputs of the process are:

• RFC’s (Request for Change) to start the change process to solve the Known Errors. • Management Information

• Work arounds • Known Errors

• Update Problem records and solved problems records if the known error is solved. The following picture summarizes this. The center diamond highlights the Problem Management activities which we will look at in the next module.

8.4

Activities

The ITIL Problem Management has four primary activities as follows: • Problem Control

• Error Control

• Proactive Problem management • Completion of Major Problem Reviews

Problem Control

The Main activities within Problem Control are: • Identification and recording of Problems

o Some problems can be identified by processes outside Problem Management (eg. Capacity Management).

o It is also wise to note that the time, effort and cost that goes into fixing problems must be weighed up against the benefits of doing so. If costs outweigh benefits a simple Problem record can be created that links all affected C.I.'s, RFC's and Incidents.

• Classification of Problems

o This activity centres on understanding what the impact on agreed service levels is of the problem. Classification of problems is similar to Incident classification (impact, urgency, priority).

• Investigation and diagnosis of Problems

o This is the step where we get to understand what it is that is causing the problem. This step is vastly different from Incident Management investigation where the focus is "rapid restoration" of service.

Error Control

Error Control is the process in which the Known Errors are researched and corrected. The request for change comes from this sub-activity and is submitted to Change Management and then following approval the change is actioned.

Proactive Problem Management

The best problems are the ones that never happen. !

Proactive Problem Management focuses the analysis of data gathered from other processes and the goal is to define “Problems”. These problems are then passed off to Problem and Error Control procedures, as if they had happened.

The activity includes: • Trend analysis

• Using data to highlight potentially weak components. • Targeting preventative action

• Trend analysis can lead to identifying general problem areas.

The aim of proactive Problem Management is to redirect efforts away from always being reactive, to proactively preventing incidents occurring in the first place.

Completion of Major Problem reviews

At the end of a major problem cycle, there should be a review to learn: 1. What things were done right?

2. What things should we have done differently?

3. What lessons do we take away from solving this problem?

8.5

Roles

Problem Manager Role

The role of Problem Manager is responsible for:

• Developing and maintaining Problem Control and Error Control

• Assessing the efficiency and effectiveness of Problem Control and Error Control • Providing management information

• Managing Problem Management personnel • Obtaining the resources for the required activities

• Developing and improving Problem Control and Error Control systems

• Analyzing and evaluating the effectiveness of Proactive Problem Management

Problem Support role

The Problem Support team is responsible for: • Identification of Problems

• Investigation of Problems leading to the Known Errors • Monitoring the Process of eliminating Known Errors • Raise RFC’s when necessary

• Identify trends

• Communicate Work arounds and quick fixes to Incident Management

Relationships

The Problem Management process has a close connection with the following ITIL processes"

Control provides accurate information so that Problem Control can solve the Known Errors easier. Problem management will supply Incident Management with workarounds and quick fixes where possible.

Change Management:

If Problem Management finds the solution to a Known Error they have to submit a RFC for the Change. Change Management is responsible for the implementation of the Change. When it is implemented they, together with Problem Management, review the Problem to verify that it is solved by the Change. This is called a Post Implementation Review after which Problem Management can close the problem record.

Configuration Management:

The information that is provided by Configuration Management is important in diagnosing problems. It includes information about C.I.’s and relationships with other C.I.’s.

Other processes:

Service Level management, Configuration Management and Availability management all provide Problem Management with information, which help to define and determine the impact of problems. In return Problem Management provide these and the other process with relevant information, for example, to SLM if a problem causes a IT Service to operate outside the Agreed Service levels or to Capacity Management if a certain hard disk is the cause of a Problem.

8.6

Benefits

Problem Management improves the IT service quality by resolving the root cause of incident(s). This leads to lower amounts of Incidents - benefiting users, customers, the organization and the IT department:

Advantages are:

• Better quality of IT Service Management

• Reliable IT Service results in a better reputation of the IT service • Ability to learn from the past

• IT staff will be more productive

• Higher resolution rate for Incidents at the Service Desk first time around • Less Incidents

8.7

Common Problems

Common problems for Problem Management include:

1. Incident Management and Problem Management don’t have well defined interfaces with each other.

2. Known Errors are not communicated to Service Desk/Incident Management

3. No Commitment from Management

4. Unrealistic expectations of the Problem Management process.

The following slide raise these and other points of attention that have to be considered.

8.8

Metrics

Successful Problem Management can be measured by:

• Reduction in Incidents because the underlying causes are removed • The time that is needed to resolve Problems

• The other costs that are incurred associated with the resolution

Within Problem management there is lot that can be measured. It depends on the scope of Problem management as to what is relevant.

Some examples are:

• Time spent per organizational unit fixing problems • Number of RFC's raised

• Ratio of proactive to reactive problem management

In document 881 Compact IC pro. 881 Compact IC pro Anion MCS. Manual ES (página 110-117)