-
Event occurs
-
Events always occurs
- It is important to understand which events need to be captured.
-
Many CIs are configured to generate a standard set of events
- based on designers' experiences
-
Event Notification
-
Communication of events
- A device is interrogated by a management tool for some specific data. This is called "Polling".
- A CI generates notification when certain conditions are met.
-
Communication Standard
-
Can be propriatory
- Only Manufacture' management tools can be used to get notification
-
Can be open standard
- Can be using Open Standard like Simple Network Management Protocol (SNMP) or XML
-
Designing of notification
- Sometimes an agent needs to be installed for monitoring
-
Service Design phase should define which events needs to be generated for monitoring
-
These should be tested during Service Transition
- However many organisations do it by trial and error method
- System Managers use standard set of events as a starting points and tune the CIs over time
- Problems in this approach
- 1. lacks in planning and improvement opportunities
- 2. makes it difficult to manage all services and stuffs
- A general approach should be to include more meaningful data with clear target audience for better decision about the events
- Meaningful data and clearly defined roles and responsibilities should be documented during Service Design and Service Transition phase.
- if roles and responsibilities are not clearly defined, in a wide alert, no one knows who is doing what
- Results in events being missed or duplicated.
-
Event Detection
-
Once a event is notified, it should be detected, read and interpreted for a meaning of the event
- this can be detected by either by an agent running on the same system
- or by a management tool running somewhere else.
-
Event Logged
- There should be a record of events and subsequent actions
-
Can be logged on a event management tool
-
or can be left as an entry in the system log
- in this case, there should be a standing order for the appropriate operations management staff to check the log on a regular basis and a clear integration on how to use each log.
- Event management procedure for each system need to be defined standard on how long the events logs are to be kept before being archived and deleted
-
Event information logs may not be meaningful untill an incident occurs
- Technical management staff use the logs to investigation where the incident originated.
-
1st level event correlation filtering
-
Purpose
- is to correlate and filtering
- if it needs to be communicated to event management tool or to ignore it
-
Significance of Events
-
Informational
- These type of events does not require any action
- Does not require any action
- Kept for a predetermined period
- Usually kept for a status of any devices
- Usually be recorded in a log file ; Example: A user logs onto an application, A device has come online, A transaction has been completed successfully
-
Warning
- These type of events generated when a service or a device has reached a threshold
- Indicates a situation must be checked and appropriate action must be taken
-
Warning not typically raised for a device failure. Example: The collision rate of a network has increased by 15% over past hour, Memory utilization of a server has reached 80% etc.
- A debate is, if a failure of a redundant device is still be treated as warning or exception
- A good rule is: every failure should be treated as an exception, because the risk of an incident impacting the business is much greater
-
Exception
- These type of events generated when a service or a device is operating abnormally
- Typically this means, an OLA or SLA have been breached
- Business has been impacted
- can represent a total failure
- These event are managed by raising an incident record or an RFC, or both. Examples: A server is down, more than 150 users have logged on to the general ledger concurrently
-
2nd Level event correlation & filtering
-
Normally done by correlation engine
-
Part of a management tool
-
Compares events with a set of
- Criteria
- Called business criteria though fairly technical
- because events may impact business
- Rules can be used to determine business impact
- Rules
-
Example:
- Number of similar events
- Number of CIs generating similar events
- Whether a specific action is associated with the code or data in the events
- Whether event represents an exceptions
- A comparison of utilization information in the events with max or min standard
- Whether further action required
- Categorisation of events
- Assigning a priority level to the events
-
Further action required?
-
If correlation engine recognizes an event
-
Further action is required
- Initialization of Incident Management process
- Initialization of Change Management process
- Investigation of Change Management if any change has created an event
- Executing a script for specific type of events
- Notifying the event to a person via mobile phone
-
Response Selection
-
Auto response
- Responses are already identified
- Result of good Service Design or Problem Management
-
The response will initiate action and then evaluate whether completed successfully.
- If not, a an incident or problem record will be generated.
- Example: Rebooting a device or locking a device to protect against unauthorised access
-
Alert and Human Intervention
- If an event requires human intervention, it will need to be escalated
- An appropriate skilled person or team will handle the events
- The event to have all the necessary information for the skilled person or team to determine correct action
- Example: Changing a toner cartridge in a printer when the level is low
-
Incident, Problem or Change
- Some Events may need an action through Incident, Problem or Change management process
-
Open an RFC
-
There are two places where an RFC can be created
- When an exception occurs
- When it is identified that an unauthorised change has happened
- Open an RFC
- Do a proper investigation for the unauthorised change
- implies that the Change Management process is not effective
- Correlation Engine determines that a change is required
- It has to be determined at the Service Design stage
- Or it has happened before and Problem Management updated the Correlation Engine to take this action
-
Open an Incident record
-
An incident can be generated when an exception is detected
- When an incident record is opened as much information as possible should be included
- with links to possible events concerned
- if possible, a completed diagnostic script
-
Open or Link to a problem Record
- It is rare for a problem record to be opened without related incident
- This step refers to linking an incident to an an existing Problem record
- This will assist the problem management teams to reassess the severity and impact of the problem
- This may result in a changed priority to an outstanding problem
- This will also allow to allow a root cause analysis
-
Special types of Incident
-
In some cases, an event will indicate an exception that does not directly impact any IT services
- Example: a redundant AC fails
- An unauthorised person has entered Data Centre
- An incident should be logged in using appropriate model
- The incident should be escalated to a specific group that manages that type of incident
- This type of incident should mention that it is an operational issue rather than a service issue
- no Outage
- should not be used to calculate any downtime
- Can be used to demonstrate how proactive IT has been in making service available
- Some events may require action on combination of these three processes
-
Review Actions
-
It is important to handle any significant events or exception appropriately
- In many cases, this is done automatically
-
It is intended to ensure that expected review of any change, incident or problem management created from any event, is not getting duplicated or lost.
- A proper design should be identified in the Design Phase
- The review should also be used as an input to Continual Improvement Process
-
Close event
-
Informational events simply logged
- Used as an input for other processes
-
Sometimes, it is very difficult to relate open and closed events if they are in different format
- It is suggested, that device in the infrastructure produces events in the same format
- In the case of events that generated an incident, problem or change, these should be formally closed with a link to appropriate record from other process