It is important to understand which events need to be captured.
Many CIs are configured to generate a standard set of events
based on designers' experiences
Event Notification
Communication of events
A device is interrogated by a management tool for some specific data. This is called "Polling".
A CI generates notification when certain conditions are met.
Communication Standard
Can be propriatory
Only Manufacture' management tools can be used to get notification
Can be open standard
Can be using Open Standard like Simple Network Management Protocol (SNMP) or XML
Designing of notification
Sometimes an agent needs to be installed for monitoring
Service Design phase should define which events needs to be generated for monitoring
These should be tested during Service Transition
However many organisations do it by trial and error method
System Managers use standard set of events as a starting points and tune the CIs over time
Problems in this approach
1. lacks in planning and improvement opportunities
2. makes it difficult to manage all services and stuffs
A general approach should be to include more meaningful data with clear target audience for better decision about the events
Meaningful data and clearly defined roles and responsibilities should be documented during Service Design and Service Transition phase.
if roles and responsibilities are not clearly defined, in a wide alert, no one knows who is doing what
Results in events being missed or duplicated.
Event Detection
Once a event is notified, it should be detected, read and interpreted for a meaning of the event
this can be detected by either by an agent running on the same system
or by a management tool running somewhere else.
Event Logged
There should be a record of events and subsequent actions
Can be logged on a event management tool
or can be left as an entry in the system log
in this case, there should be a standing order for the appropriate operations management staff to check the log on a regular basis and a clear integration on how to use each log.
Event management procedure for each system need to be defined standard on how long the events logs are to be kept before being archived and deleted
Event information logs may not be meaningful untill an incident occurs
Technical management staff use the logs to investigation where the incident originated.
1st level event correlation filtering
Purpose
is to correlate and filtering
if it needs to be communicated to event management tool or to ignore it
Significance of Events
Informational
These type of events does not require any action
Does not require any action
Kept for a predetermined period
Usually kept for a status of any devices
Usually be recorded in a log file ; Example: A user logs onto an application, A device has come online, A transaction has been completed successfully
Warning
These type of events generated when a service or a device has reached a threshold
Indicates a situation must be checked and appropriate action must be taken
Warning not typically raised for a device failure. Example: The collision rate of a network has increased by 15% over past hour, Memory utilization of a server has reached 80% etc.
A debate is, if a failure of a redundant device is still be treated as warning or exception
A good rule is: every failure should be treated as an exception, because the risk of an incident impacting the business is much greater
Exception
These type of events generated when a service or a device is operating abnormally
Typically this means, an OLA or SLA have been breached
Business has been impacted
can represent a total failure
These event are managed by raising an incident record or an RFC, or both. Examples: A server is down, more than 150 users have logged on to the general ledger concurrently
2nd Level event correlation & filtering
Normally done by correlation engine
Part of a management tool
Compares events with a set of
Criteria
Called business criteria though fairly technical
because events may impact business
Rules can be used to determine business impact
Rules
Example:
Number of similar events
Number of CIs generating similar events
Whether a specific action is associated with the code or data in the events
Whether event represents an exceptions
A comparison of utilization information in the events with max or min standard
Whether further action required
Categorisation of events
Assigning a priority level to the events
Further action required?
If correlation engine recognizes an event
Further action is required
Initialization of Incident Management process
Initialization of Change Management process
Investigation of Change Management if any change has created an event
Executing a script for specific type of events
Notifying the event to a person via mobile phone
Response Selection
Auto response
Responses are already identified
Result of good Service Design or Problem Management
The response will initiate action and then evaluate whether completed successfully.
If not, a an incident or problem record will be generated.
Example: Rebooting a device or locking a device to protect against unauthorised access
Alert and Human Intervention
If an event requires human intervention, it will need to be escalated
An appropriate skilled person or team will handle the events
The event to have all the necessary information for the skilled person or team to determine correct action
Example: Changing a toner cartridge in a printer when the level is low
Incident, Problem or Change
Some Events may need an action through Incident, Problem or Change management process
Open an RFC
There are two places where an RFC can be created
When an exception occurs
When it is identified that an unauthorised change has happened
Open an RFC
Do a proper investigation for the unauthorised change
implies that the Change Management process is not effective
Correlation Engine determines that a change is required
It has to be determined at the Service Design stage
Or it has happened before and Problem Management updated the Correlation Engine to take this action
Open an Incident record
An incident can be generated when an exception is detected
When an incident record is opened as much information as possible should be included
with links to possible events concerned
if possible, a completed diagnostic script
Open or Link to a problem Record
It is rare for a problem record to be opened without related incident
This step refers to linking an incident to an an existing Problem record
This will assist the problem management teams to reassess the severity and impact of the problem
This may result in a changed priority to an outstanding problem
This will also allow to allow a root cause analysis
Special types of Incident
In some cases, an event will indicate an exception that does not directly impact any IT services
Example: a redundant AC fails
An unauthorised person has entered Data Centre
An incident should be logged in using appropriate model
The incident should be escalated to a specific group that manages that type of incident
This type of incident should mention that it is an operational issue rather than a service issue
no Outage
should not be used to calculate any downtime
Can be used to demonstrate how proactive IT has been in making service available
Some events may require action on combination of these three processes
Review Actions
It is important to handle any significant events or exception appropriately
In many cases, this is done automatically
It is intended to ensure that expected review of any change, incident or problem management created from any event, is not getting duplicated or lost.
A proper design should be identified in the Design Phase
The review should also be used as an input to Continual Improvement Process
Close event
Informational events simply logged
Used as an input for other processes
Sometimes, it is very difficult to relate open and closed events if they are in different format
It is suggested, that device in the infrastructure produces events in the same format
In the case of events that generated an incident, problem or change, these should be formally closed with a link to appropriate record from other process