Service Operation Processes

Event Management
1. Purpose/goal/objective
  1. The ability to detect events, make sense of them and determine the action needed
  2. EM is basis for Operational Monitoring and Control, so automation is desirable
2. Scope
  1. It can be applied to any aspect of Service Management that needs to be controlled and which can be automated
  2. Monitoring is broader than EM, it will also seek out conditions that doe not generate events
3. Value to Business
  1. It provides mechanisms for early detection of incidents, it may be detected before any actual service outage occurs
  2. It makes it possible for some type of automated activity. Thus reduction of expensive real-time monitoring
  3. Integration with other processes (availability/capacity) can create early respons
  4. EM provides a basis for automated operations
4. Policies/Principles/basic concepts
  1. Events that signify regular operation (e.g. user has logged on)
  2. Events that signify an exception (users types wrong password/cpu too hot)
  3. Events that signify unusual but not exceptional operations (memory utilization aboven 90%)
5. Process activities, methods and techniques
  1. Event occurs
    1. Events occur continuously but not all of them are detected or registered
  2. Event notification
    1. A device can be interrogated by a management tool, which collets certain targeted data
    2. Other option: The CI is capable of generation a notification
  3. Event detection
    1. After generating, an event will be detected by an agent running on the same system or send directy to a predefined tool
  4. Event filtering
    1. Purpose is to decide what kind of communication is needed (e.g. just place in logfile, or send a warning (if more critical)
  5. Significance of events
    1. Informational (does not require action, e.g. user logs on)
    2. Warning (reaching a threshold
    3. Exception (could represent a total failure or degraded performance e.g. server is down)
  6. Event correlation
    1. The meaning and significance is defined (based on e.g. numeber of similar events, number of CI's generating similar events etc)
  7. Trigger
    1. Of the correlation activity recongnizes an event, a response will be requierd. The mechanism used to initiate a response is called a trigger.
    2. Examples: Incident triggers generate a record in the IM system. Paging systems that will notify a person or team by mobile phone.
  8. Response selection
    1. Event logged e.g. record logged of the event
    2. Auto Response e.g. restarting a service
    3. Alert and human intervention e.g. SMS sent
    4. Open an RFC, Incident or Problem
  9. Review Actions
    1. Evaluate e.g. the handover to other processes
  10. Close event
6. Triggers, input and output/interpoces interfaces
7. Information Management
  1. SNMP messages
  2. Managment Information Bases (MIB) of IT Devices.
  3. Vendor's monitoring tools agent software
  4. Correlation Engines
8. Challenges, Critical Succces Factors and risks
  1. Challenge: e.g. funding, setting the correct level of filtering, skills
  2. CSF: e.g. achieving correct level of filtering there are three keys to the correct level:
    1. Integrate EM into all Service Management processes
    2. Design new services with EM in mind
    3. Trial and error
  3. Risk: not achieving the above
9. Designing for Event Management
Incident Management
1. Purpose/goal/objective
  1. Primary goal is to restore normal service operation as quickly as possible and minimize the impact on business operations.
2. Scope
  1. Includes any event which disrupts, or which could disrupt a service
3. Value to Business
  1. The ability to detect and resolve incidents, which results in lower downtime to the business
  2. The ability to align IT activity to real-time business properties (IM is able to identify business priorities and respond to)
  3. The ability to identify potential improvements to services
  4. TH SD can identity additional service or training requirements found in IT or the business
4. Policies/Principles/basic concepts
  1. Timescales
    1. SLA's, OLA's UC's
  2. Incident models (should include:)
    1. The steps that should be taken to handle the incident
    2. The chronological order these stepts should be taken in, with any dependences or co-processing defined
    3. Responsibilities; who should do waht
    4. Timescales, and thresholds for completion of the actions
    5. Escalation procedures; who should be contacted when
    6. Any necessary evidence-preservation activities (particulary relevant for security and capacity related incidents
  3. Major incidents
    1. A separate procedure with shorter timescales and greater urgency. Definition is needed to make sure it's not abused
5. Process activities, methods and techniques
  1. Incident identification
    1. Work cannot begin on dealing with an incident until it is known that an incident has occured
    2. Ideally, EM detects failure early so incidents are resolved before they have an impact on users!!
  2. Incident logging
    1. All incidents must be fully logged and date/time stamped
    2. Put in appropriate info: categorization/urgency/impact/priorization/method of notification/name/department/phone etc
  3. Incident categorization
    1. Allocate suitable incident categorization coding so that the exact type of the call is recorded
    2. This is important to be able to look later at incident types/frequencies/trends
  4. Incident prioritization
    1. based on urgency and impact
    2. To be discussed and agreed upon with customer
  5. Initial diagnosis
    1. The SD analyst will do this e.g. when customer is still on the phone tog get more information and tries to 'hit-in-one'
    2. If not, he can see if he can resolve it within agreed time limiet, if not: pass on to other support group
  6. Incident escalation
    1. Functional escalation
      1. Passing through to second level support or third level and so on. No matter if this group is internal or external
      2. Note: Incident Ownership remains with the Service Desk!
    2. Hierarchic escalation
      1. Passing through to management when solution takes too much time or proves to be too difficult or when discussion
      2. SD should keep the user informed of any relevant escalation and ensure the Incident Recore is update to keep a full history of actions
  7. Investigation and diagnosis
    1. All actions of investigation and diagnosis should be documented parallel with the activites to prevent valuable loss of time
  8. Resolution and Recovery
    1. When a potention resolution has been identified, this should be applied and tested. This to ensure that the service has been fully restored
    2. In some cases it may be necessary for twho or more groups to take separate, though coordinated recovery actions for an overall solution
  9. Incident closure
    1. Closure categorization => check and confirm that the initial incident categorization was correct
    2. User satisfaction survey
    3. Incident documentation (to have an full historic record, which can be used by other processed to improve service)
    4. Ongoing or recurring problem? If so record a Problem Record
    5. Formal Closure
6. Triggers, input and output/interpoces interfaces
  1. IM has an interface with Problem Management
  2. IM has an interface with Configuarion Management
  3. IM has an interface with Change Management
  4. IM has an interface with Capacity Management
  5. IM has an interface with Availability management
  6. IM has an interface with SLM
7. Information Management
  1. Incident Management tool
  2. Incident Records
  3. Known error Database
  4. cms
8. Metrics
  1. Total numbers of incidents
  2. Breakdown of incidents at each stage
  3. Size of current incident backlog
  4. Number and percentage of Major incidents
  5. Percentage of incidents handled within agreed response time
  6. Average cost per incident
  7. etc
9. Challenges, Critical Succces Factors and risks
  1. Challenges
    1. Ability to detect incidents as early as possible
    2. Convincing all staff that all incidents must be logged and kept up-to-date
    3. Availability of information about problems and known errors
    4. Integration into the CMS
    5. Integration into the SLM
  2. CSF's
    1. A good SD is key to successful IM
    2. Clearly defined tagets to work to-as defined in SLA's
    3. Adequate customer-oriented and technically training support staff with the correct skill levels
    4. Integrated support tools to drive and control the process
    5. OLA's and CU's that are capable of influencing and shaping the correct behaviour of all support staff
  3. Risks
    1. Opposite of above mentioned Challenges and CSF's
Request Fulfilment
1. Purpose/goal/objective
  1. To provide a channel for users to request and receive standard services for which a pre-defined approval and qualification process exists
  2. To provide information to users and customers about the availability of services and the procedure for obtaining them
  3. To source and deliver the components of requested standard services (e.g. licences and software media)
  4. To assist with general information, complaints of comments
2. Scope
  1. A service request is usually something that can and should be planned. Whereas an incident is usually an unplanned event
3. Value to Business
  1. To provide quick and effective access to standard services
  2. Effective use of RF reduces the bureaucracy involved in requesting and receiving access to existing of rew services
  3. RF is a lean process that can help reduce costs.
4. Policies/Principles/basic concepts
  1. Many Service Requests will be recurring so a predefined processflow (a model) is advised (with e.g. timescales, SLA's, escalation paths
  2. Ownership of service requests resides with the SD
5. Process activities, methods and techniques
  1. Menu selection
  2. Financial approval
  3. Other approval
  4. Fulfilment
  5. Closure
6. Triggers, input and output/interpoces interfaces
  1. SD/IM: many Service Requests may come in via the SD and may initially be handled through the IM process
  2. A strong link is also needed between RD and Release, Asset and configuration Management
7. Information Management
  1. Service Request (with information about, the kind of request, by whom, for whom by what process
  2. RFC, most typical if the SR relates to a CI
  3. Service Portfolio, to enable the scope of agreed SR to be identified
  4. Security policies
8. Challenges, Critical Succces Factors and risks
  1. Challenges
    1. Clearly defining and documenting the type of requests that will be handled within the RF process.
    2. Establishing self-help front-end capabilities that allow the users to interface successfully with the RF process
  2. CSF
    1. Agreement of what services will be standardized and who is authorizes to request them (cost must also be agreed upon)
    2. Publication of the services to users as part of the Service Catalogue
    3. Definition of standard fulfilment procedure for each of the services being requested.
    4. A SPOC which can be used to request the services, mostly the SD
    5. Self-service tools needed to provide a fron-end interface to the users. It is essential that these integrate with the back-end fulfilment tools
  3. RIsks
    1. Poorly defined scope, wehere people are unclear about exactly what teh process is expected to handle
    2. Poorly designed or implemented user interfaces so that users have difficulty reiasing a request
    3. Badly designed or operated back-end fulfilment processes that are unable to deal with volume or nature of requests
    4. Inadequate monitoring capabilities so that accurate metrics cannot be gathered
Problem Management
1. Purpose/goal/objective
  1. Responsible for managing the lifecycle of all problems
  2. Objective is to prevent problems and resulting incidents to happen.
  3. To eliminate recurring incidents
  4. Minimze the impact of incidents that cannot be prevented
2. Scope
  1. Activities required to diagnose the root cause of incident and determine the resolution
  2. Responsible for ensuring that the resolution is implemented through the appropriate control procedures (.e.g. CM and Rel Mgt)
  3. PM maintains information about problems and wordarounds
3. Value to Business
  1. Higher availability of IT services
  2. Higher productivity of business and IT staff
  3. Reduced expenditure on workarounds or fixes that do not work
  4. Reduction in cost of effort in fire-fighting or resolving repeat incidents
4. Policies/Principles/basic concepts
  1. Problem models are important, See incident model
5. Process activities, methods and techniques
  1. Reactive
  2. Proactive
  3. Problem detection
  4. Problem logging
  5. Problem categorization
  6. Problem priorization
  7. Problem investigation and diagnosis
    1. Chronological analysis
    2. Pain Value analysis
    3. Kepner and Tregoe
    4. Brainstorming
    5. Ishikawa diagrams (diagram as a result of brainstorming
    6. Pareto analysis (ranking chart with most likely cause at top, less likely or trivial at bottom)
  8. Workarounds
  9. Raising a known error record
  10. Problem resolution (if RFC needed: Emergency RFC and ECAB!)
  11. Problem closure
  12. Major problem review
    1. Examine things that went well
    2. Examine things that went wrong
    3. What could be done better in the future
    4. How to prevent recurrence
    5. Whether there has been any third-party responsibility and wheter follow-up actions are needed
  13. Errors detected in the development environment
    1. When application is released but not error-free, workarounds etc should to be found in the KEDB. This should be tested!
6. Triggers, input and output/interpoces interfaces
  1. ST - Change mgt
  2. ST - Config mgt
  3. ST - R&D mgt
  4. SD - Availability mgt
  5. SD - Cap. mgt
  6. SD - IT service Continuity
  7. CSI - SLM
  8. SS - Financial mgt
7. Information Management
  1. CMS - details of components and their relationships, previous actions are recorded
  2. KEDB
8. Metrics
  1. Total number of problems recorded in a period of time
  2. Percentage of solved & not-solved within SLA
  3. Average cost of a problem
  4. Number of major problems
  5. Percentage of accuracy of the KEDB
9. Challenges, Critical Succces Factors and risks
  1. Linking IM & PM tools
  2. The ability to relate incidents and Problem records
  3. The second- and third-line staff should have an good working relationship with staff on the first line
  4. Making shure that business impact is well understood by all staff working on Problem resolution
Access Management
1. Purpose/goal/objective
  1. Provides the right for users to be able to use a service or group of services.
  2. It is therefore the execution of policies and actions defined in Security and Availability mgt
2. Scope
  1. It enabled to manage the confidentiality, availability and integrity of the organization's data and intellectual property
  2. Ensuring that users are given the right to use a service
  3. AM tries to provide that the service is available at all degreed time
3. Value to Business
  1. COntrolled access to services ensures that the organization is able to maintain its information effective and confidentially
  2. Employees have the right level of access to execute their jobs effectively
  3. Less likelihood of errors being made in data entry of in the use of a critical service by an unskilled user.
  4. The ability to audit use of services and to trace the abuse of services
  5. The ability more easily to revoke access rights when needed - an important security consideration
  6. May be needed for regulatory compliance (e.g. SOX, COBIT)
4. Policies/Principles/basic concepts
  1. Access - level and extent of a service's functionality of data that a user is entitled to use
  2. Identify - tho make sure it's the right user
  3. Rights - actual settings whereby a user is provided acceess to a service or group of services
  4. Services of service group - e.g. basic set of applications
  5. Directory services - specific type of tool that is used to manage acces and rights
5. Process activities, methods and techniques
  1. Requesting access
  2. Verification - is the user who he claims to be and has he a legitimate reason for wanting access
  3. Providing rights - look out for Role Conflict situations!
  4. Monitoring identity status -change may need other privileges e.g. job changes, promotion, transfer, resignation, death
  5. Logging and tracking access - to detect abuse of access rights
  6. Removing or restricting rights
6. Triggers, input and output/interpoces interfaces
  1. RFC
  2. Service request
  3. Request from HRM
  4. Request from manager of department
7. Information Management
  1. Identify -identifying an user
  2. Users, groups, roles and service groups
8. Metrics
  1. Number of requests for access
  2. Instances of access granted
  3. Number of incidents requiring a reset of access rights
  4. Number of incidents caused by incorrect access settings
9. Challenges, Critical Succces Factors and risks
  1. The ability to verify the identity of a user
  2. The ability to verify the identity of the approving person
  3. The ability to verify that a user qualifies for access to a specific service
  4. The ability to link mutiple access rights to an individual user
  5. The ability to manage changes to a user's access requirements
  6. A database of all users and the right that they have geen granted
  7. The ability to determine the status of the user at any time (are they still employee at the moment the log on?)
Operational activities of processes covered in other lifecycle phases
1. Change Management
  1. Raising and submitting RFC's as needed to address SO issues
  2. Implementing changes as directed by CM where they involve SO component or services
  3. Helping define and maintain change models relating to SO components or services
  4. Using the CM process for standard, operational-type changes
2. Configuration Management
  1. Informing config mgt of any discrepancies found between any CIs and the CMS
  2. Making any amendments necessary to correct any discrepancies under the authority of Config Mgt
3. Release & Deployment
  1. Actual implementation actions regarding the deployment of new releases, under the direction of R&D mgt
  2. Participation in the planning stages of major new releases to advise on SO issues
  3. THe physical handling of CI's from/to the DML as required to fulfil their operational roles.
4. Capacity Management
  1. Capacity & performance monitoring.
  2. Handling capacity- or performance related incidents
  3. Capacity and Performance trends
  4. Storage of Capacity mgt data
  5. Demand mgt e.g. increasing or decreasing priviliges to use a service
  6. Workload mgt
    1. Rescheduling a service (out of peek times)
    2. Moving a service or workload (e.g. another location)
    3. Virtualization
  7. Modelling and application sizing (role e.g. evaluating and feeding back discrepancies
  8. Capacity planning
5. Availability Management
  1. e.g. collecting data
  2. Review of maintenance activities
  3. Major problem reviews
  4. Involvement in Specific initiatives
6. Knowledge Management
  1. Properly gathering and storing data
7. Financial Management for IT services
  1. Proper planning of resources
  2. Charging of incidents and changes
8. IT Service Continuity Management
  1. Risk assessment - using the knowledge of the infrastructure and techniques
  2. Assistance in writing the actual recovery plans for systems and services under its control
  3. Participation in testing plans