-
Event Management
-
Purpose/goal/objective
- The ability to detect events, make sense of them and determine the action needed
- EM is basis for Operational Monitoring and Control, so automation is desirable
-
Scope
- It can be applied to any aspect of Service Management that needs to be controlled and which can be automated
- Monitoring is broader than EM, it will also seek out conditions that doe not generate events
-
Value to Business
- It provides mechanisms for early detection of incidents, it may be detected before any actual service outage occurs
- It makes it possible for some type of automated activity. Thus reduction of expensive real-time monitoring
- Integration with other processes (availability/capacity) can create early respons
- EM provides a basis for automated operations
-
Policies/Principles/basic concepts
- Events that signify regular operation (e.g. user has logged on)
- Events that signify an exception (users types wrong password/cpu too hot)
- Events that signify unusual but not exceptional operations (memory utilization aboven 90%)
-
Process activities, methods and techniques
-
Event occurs
- Events occur continuously but not all of them are detected or registered
-
Event notification
- A device can be interrogated by a management tool, which collets certain targeted data
- Other option: The CI is capable of generation a notification
-
Event detection
- After generating, an event will be detected by an agent running on the same system or send directy to a predefined tool
-
Event filtering
- Purpose is to decide what kind of communication is needed (e.g. just place in logfile, or send a warning (if more critical)
-
Significance of events
- Informational (does not require action, e.g. user logs on)
- Warning (reaching a threshold
- Exception (could represent a total failure or degraded performance e.g. server is down)
-
Event correlation
- The meaning and significance is defined (based on e.g. numeber of similar events, number of CI's generating similar events etc)
-
Trigger
- Of the correlation activity recongnizes an event, a response will be requierd. The mechanism used to initiate a response is called a trigger.
- Examples: Incident triggers generate a record in the IM system. Paging systems that will notify a person or team by mobile phone.
-
Response selection
- Event logged e.g. record logged of the event
- Auto Response e.g. restarting a service
- Alert and human intervention e.g. SMS sent
- Open an RFC, Incident or Problem
-
Review Actions
- Evaluate e.g. the handover to other processes
- Close event
- Triggers, input and output/interpoces interfaces
-
Information Management
- SNMP messages
- Managment Information Bases (MIB) of IT Devices.
- Vendor's monitoring tools agent software
- Correlation Engines
-
Challenges, Critical Succces Factors and risks
- Challenge: e.g. funding, setting the correct level of filtering, skills
-
CSF: e.g. achieving correct level of filtering there are three keys to the correct level:
- Integrate EM into all Service Management processes
- Design new services with EM in mind
- Trial and error
- Risk: not achieving the above
- Designing for Event Management
-
Incident Management
-
Purpose/goal/objective
- Primary goal is to restore normal service operation as quickly as possible and minimize the impact on business operations.
-
Scope
- Includes any event which disrupts, or which could disrupt a service
-
Value to Business
- The ability to detect and resolve incidents, which results in lower downtime to the business
- The ability to align IT activity to real-time business properties (IM is able to identify business priorities and respond to)
- The ability to identify potential improvements to services
- TH SD can identity additional service or training requirements found in IT or the business
-
Policies/Principles/basic concepts
-
Timescales
- SLA's, OLA's UC's
-
Incident models (should include:)
- The steps that should be taken to handle the incident
- The chronological order these stepts should be taken in, with any dependences or co-processing defined
- Responsibilities; who should do waht
- Timescales, and thresholds for completion of the actions
- Escalation procedures; who should be contacted when
- Any necessary evidence-preservation activities (particulary relevant for security and capacity related incidents
-
Major incidents
- A separate procedure with shorter timescales and greater urgency. Definition is needed to make sure it's not abused
-
Process activities, methods and techniques
-
Incident identification
- Work cannot begin on dealing with an incident until it is known that an incident has occured
- Ideally, EM detects failure early so incidents are resolved before they have an impact on users!!
-
Incident logging
- All incidents must be fully logged and date/time stamped
- Put in appropriate info: categorization/urgency/impact/priorization/method of notification/name/department/phone etc
-
Incident categorization
- Allocate suitable incident categorization coding so that the exact type of the call is recorded
- This is important to be able to look later at incident types/frequencies/trends
-
Incident prioritization
- based on urgency and impact
- To be discussed and agreed upon with customer
-
Initial diagnosis
- The SD analyst will do this e.g. when customer is still on the phone tog get more information and tries to 'hit-in-one'
- If not, he can see if he can resolve it within agreed time limiet, if not: pass on to other support group
-
Incident escalation
-
Functional escalation
- Passing through to second level support or third level and so on. No matter if this group is internal or external
- Note: Incident Ownership remains with the Service Desk!
-
Hierarchic escalation
- Passing through to management when solution takes too much time or proves to be too difficult or when discussion
- SD should keep the user informed of any relevant escalation and ensure the Incident Recore is update to keep a full history of actions
-
Investigation and diagnosis
- All actions of investigation and diagnosis should be documented parallel with the activites to prevent valuable loss of time
-
Resolution and Recovery
- When a potention resolution has been identified, this should be applied and tested. This to ensure that the service has been fully restored
- In some cases it may be necessary for twho or more groups to take separate, though coordinated recovery actions for an overall solution
-
Incident closure
- Closure categorization => check and confirm that the initial incident categorization was correct
- User satisfaction survey
- Incident documentation (to have an full historic record, which can be used by other processed to improve service)
- Ongoing or recurring problem? If so record a Problem Record
- Formal Closure
-
Triggers, input and output/interpoces interfaces
- IM has an interface with Problem Management
- IM has an interface with Configuarion Management
- IM has an interface with Change Management
- IM has an interface with Capacity Management
- IM has an interface with Availability management
- IM has an interface with SLM
-
Information Management
- Incident Management tool
- Incident Records
- Known error Database
- cms
-
Metrics
- Total numbers of incidents
- Breakdown of incidents at each stage
- Size of current incident backlog
- Number and percentage of Major incidents
- Percentage of incidents handled within agreed response time
- Average cost per incident
- etc
-
Challenges, Critical Succces Factors and risks
-
Challenges
- Ability to detect incidents as early as possible
- Convincing all staff that all incidents must be logged and kept up-to-date
- Availability of information about problems and known errors
- Integration into the CMS
- Integration into the SLM
-
CSF's
- A good SD is key to successful IM
- Clearly defined tagets to work to-as defined in SLA's
- Adequate customer-oriented and technically training support staff with the correct skill levels
- Integrated support tools to drive and control the process
- OLA's and CU's that are capable of influencing and shaping the correct behaviour of all support staff
-
Risks
- Opposite of above mentioned Challenges and CSF's
-
Request Fulfilment
-
Purpose/goal/objective
- To provide a channel for users to request and receive standard services for which a pre-defined approval and qualification process exists
- To provide information to users and customers about the availability of services and the procedure for obtaining them
- To source and deliver the components of requested standard services (e.g. licences and software media)
- To assist with general information, complaints of comments
-
Scope
- A service request is usually something that can and should be planned. Whereas an incident is usually an unplanned event
-
Value to Business
- To provide quick and effective access to standard services
- Effective use of RF reduces the bureaucracy involved in requesting and receiving access to existing of rew services
- RF is a lean process that can help reduce costs.
-
Policies/Principles/basic concepts
- Many Service Requests will be recurring so a predefined processflow (a model) is advised (with e.g. timescales, SLA's, escalation paths
- Ownership of service requests resides with the SD
-
Process activities, methods and techniques
- Menu selection
- Financial approval
- Other approval
- Fulfilment
- Closure
-
Triggers, input and output/interpoces interfaces
- SD/IM: many Service Requests may come in via the SD and may initially be handled through the IM process
- A strong link is also needed between RD and Release, Asset and configuration Management
-
Information Management
- Service Request (with information about, the kind of request, by whom, for whom by what process
- RFC, most typical if the SR relates to a CI
- Service Portfolio, to enable the scope of agreed SR to be identified
- Security policies
-
Challenges, Critical Succces Factors and risks
-
Challenges
- Clearly defining and documenting the type of requests that will be handled within the RF process.
- Establishing self-help front-end capabilities that allow the users to interface successfully with the RF process
-
CSF
- Agreement of what services will be standardized and who is authorizes to request them (cost must also be agreed upon)
- Publication of the services to users as part of the Service Catalogue
- Definition of standard fulfilment procedure for each of the services being requested.
- A SPOC which can be used to request the services, mostly the SD
- Self-service tools needed to provide a fron-end interface to the users. It is essential that these integrate with the back-end fulfilment tools
-
RIsks
- Poorly defined scope, wehere people are unclear about exactly what teh process is expected to handle
- Poorly designed or implemented user interfaces so that users have difficulty reiasing a request
- Badly designed or operated back-end fulfilment processes that are unable to deal with volume or nature of requests
- Inadequate monitoring capabilities so that accurate metrics cannot be gathered
-
Problem Management
-
Purpose/goal/objective
- Responsible for managing the lifecycle of all problems
- Objective is to prevent problems and resulting incidents to happen.
- To eliminate recurring incidents
- Minimze the impact of incidents that cannot be prevented
-
Scope
- Activities required to diagnose the root cause of incident and determine the resolution
- Responsible for ensuring that the resolution is implemented through the appropriate control procedures (.e.g. CM and Rel Mgt)
- PM maintains information about problems and wordarounds
-
Value to Business
- Higher availability of IT services
- Higher productivity of business and IT staff
- Reduced expenditure on workarounds or fixes that do not work
- Reduction in cost of effort in fire-fighting or resolving repeat incidents
-
Policies/Principles/basic concepts
- Problem models are important, See incident model
-
Process activities, methods and techniques
- Reactive
- Proactive
- Problem detection
- Problem logging
- Problem categorization
- Problem priorization
-
Problem investigation and diagnosis
- Chronological analysis
- Pain Value analysis
- Kepner and Tregoe
- Brainstorming
- Ishikawa diagrams (diagram as a result of brainstorming
- Pareto analysis (ranking chart with most likely cause at top, less likely or trivial at bottom)
- Workarounds
- Raising a known error record
- Problem resolution (if RFC needed: Emergency RFC and ECAB!)
- Problem closure
-
Major problem review
- Examine things that went well
- Examine things that went wrong
- What could be done better in the future
- How to prevent recurrence
- Whether there has been any third-party responsibility and wheter follow-up actions are needed
-
Errors detected in the development environment
- When application is released but not error-free, workarounds etc should to be found in the KEDB. This should be tested!
-
Triggers, input and output/interpoces interfaces
- ST - Change mgt
- ST - Config mgt
- ST - R&D mgt
- SD - Availability mgt
- SD - Cap. mgt
- SD - IT service Continuity
- CSI - SLM
- SS - Financial mgt
-
Information Management
- CMS - details of components and their relationships, previous actions are recorded
- KEDB
-
Metrics
- Total number of problems recorded in a period of time
- Percentage of solved & not-solved within SLA
- Average cost of a problem
- Number of major problems
- Percentage of accuracy of the KEDB
-
Challenges, Critical Succces Factors and risks
- Linking IM & PM tools
- The ability to relate incidents and Problem records
- The second- and third-line staff should have an good working relationship with staff on the first line
- Making shure that business impact is well understood by all staff working on Problem resolution
-
Access Management
-
Purpose/goal/objective
- Provides the right for users to be able to use a service or group of services.
- It is therefore the execution of policies and actions defined in Security and Availability mgt
-
Scope
- It enabled to manage the confidentiality, availability and integrity of the organization's data and intellectual property
- Ensuring that users are given the right to use a service
- AM tries to provide that the service is available at all degreed time
-
Value to Business
- COntrolled access to services ensures that the organization is able to maintain its information effective and confidentially
- Employees have the right level of access to execute their jobs effectively
- Less likelihood of errors being made in data entry of in the use of a critical service by an unskilled user.
- The ability to audit use of services and to trace the abuse of services
- The ability more easily to revoke access rights when needed - an important security consideration
- May be needed for regulatory compliance (e.g. SOX, COBIT)
-
Policies/Principles/basic concepts
- Access - level and extent of a service's functionality of data that a user is entitled to use
- Identify - tho make sure it's the right user
- Rights - actual settings whereby a user is provided acceess to a service or group of services
- Services of service group - e.g. basic set of applications
- Directory services - specific type of tool that is used to manage acces and rights
-
Process activities, methods and techniques
- Requesting access
- Verification - is the user who he claims to be and has he a legitimate reason for wanting access
- Providing rights - look out for Role Conflict situations!
- Monitoring identity status -change may need other privileges e.g. job changes, promotion, transfer, resignation, death
- Logging and tracking access - to detect abuse of access rights
- Removing or restricting rights
-
Triggers, input and output/interpoces interfaces
- RFC
- Service request
- Request from HRM
- Request from manager of department
-
Information Management
- Identify -identifying an user
- Users, groups, roles and service groups
-
Metrics
- Number of requests for access
- Instances of access granted
- Number of incidents requiring a reset of access rights
- Number of incidents caused by incorrect access settings
-
Challenges, Critical Succces Factors and risks
- The ability to verify the identity of a user
- The ability to verify the identity of the approving person
- The ability to verify that a user qualifies for access to a specific service
- The ability to link mutiple access rights to an individual user
- The ability to manage changes to a user's access requirements
- A database of all users and the right that they have geen granted
- The ability to determine the status of the user at any time (are they still employee at the moment the log on?)
-
Operational activities of processes covered in other lifecycle phases
-
Change Management
- Raising and submitting RFC's as needed to address SO issues
- Implementing changes as directed by CM where they involve SO component or services
- Helping define and maintain change models relating to SO components or services
- Using the CM process for standard, operational-type changes
-
Configuration Management
- Informing config mgt of any discrepancies found between any CIs and the CMS
- Making any amendments necessary to correct any discrepancies under the authority of Config Mgt
-
Release & Deployment
- Actual implementation actions regarding the deployment of new releases, under the direction of R&D mgt
- Participation in the planning stages of major new releases to advise on SO issues
- THe physical handling of CI's from/to the DML as required to fulfil their operational roles.
-
Capacity Management
- Capacity & performance monitoring.
- Handling capacity- or performance related incidents
- Capacity and Performance trends
- Storage of Capacity mgt data
- Demand mgt e.g. increasing or decreasing priviliges to use a service
-
Workload mgt
- Rescheduling a service (out of peek times)
- Moving a service or workload (e.g. another location)
- Virtualization
- Modelling and application sizing (role e.g. evaluating and feeding back discrepancies
- Capacity planning
-
Availability Management
- e.g. collecting data
- Review of maintenance activities
- Major problem reviews
- Involvement in Specific initiatives
-
Knowledge Management
- Properly gathering and storing data
-
Financial Management for IT services
- Proper planning of resources
- Charging of incidents and changes
-
IT Service Continuity Management
- Risk assessment - using the knowledge of the infrastructure and techniques
- Assistance in writing the actual recovery plans for systems and services under its control
- Participation in testing plans