Process Flow

Event occurs
1. Events always occurs
  1. It is important to understand which events need to be captured.
2. Many CIs are configured to generate a standard set of events
  1. based on designers' experiences
Event Notification
1. Communication of events
  1. A device is interrogated by a management tool for some specific data. This is called "Polling".
  2. A CI generates notification when certain conditions are met.
2. Communication Standard
  1. Can be propriatory
    1. Only Manufacture' management tools can be used to get notification
  2. Can be open standard
    1. Can be using Open Standard like Simple Network Management Protocol (SNMP) or XML
3. Designing of notification
  1. Sometimes an agent needs to be installed for monitoring
  2. Service Design phase should define which events needs to be generated for monitoring
    1. These should be tested during Service Transition
      1. However many organisations do it by trial and error method
      2. System Managers use standard set of events as a starting points and tune the CIs over time
      3. Problems in this approach
      4. 1. lacks in planning and improvement opportunities
      5. 2. makes it difficult to manage all services and stuffs
      6. A general approach should be to include more meaningful data with clear target audience for better decision about the events
      7. Meaningful data and clearly defined roles and responsibilities should be documented during Service Design and Service Transition phase.
      8. if roles and responsibilities are not clearly defined, in a wide alert, no one knows who is doing what
      9. Results in events being missed or duplicated.
Event Detection
1. Once a event is notified, it should be detected, read and interpreted for a meaning of the event
  1. this can be detected by either by an agent running on the same system
  2. or by a management tool running somewhere else.
Event Logged
1. There should be a record of events and subsequent actions
2. Can be logged on a event management tool
  1. or can be left as an entry in the system log
    1. in this case, there should be a standing order for the appropriate operations management staff to check the log on a regular basis and a clear integration on how to use each log.
    2. Event management procedure for each system need to be defined standard on how long the events logs are to be kept before being archived and deleted
3. Event information logs may not be meaningful untill an incident occurs
  1. Technical management staff use the logs to investigation where the incident originated.
1st level event correlation filtering
1. Purpose
  1. is to correlate and filtering
  2. if it needs to be communicated to event management tool or to ignore it
2. Significance of Events
  1. Informational
    1. These type of events does not require any action
    2. Does not require any action
    3. Kept for a predetermined period
    4. Usually kept for a status of any devices
    5. Usually be recorded in a log file ; Example: A user logs onto an application, A device has come online, A transaction has been completed successfully
  2. Warning
    1. These type of events generated when a service or a device has reached a threshold
    2. Indicates a situation must be checked and appropriate action must be taken
    3. Warning not typically raised for a device failure. Example: The collision rate of a network has increased by 15% over past hour, Memory utilization of a server has reached 80% etc.
      1. A debate is, if a failure of a redundant device is still be treated as warning or exception
      2. A good rule is: every failure should be treated as an exception, because the risk of an incident impacting the business is much greater
  3. Exception
    1. These type of events generated when a service or a device is operating abnormally
    2. Typically this means, an OLA or SLA have been breached
    3. Business has been impacted
    4. can represent a total failure
    5. These event are managed by raising an incident record or an RFC, or both. Examples: A server is down, more than 150 users have logged on to the general ledger concurrently
2nd Level event correlation & filtering
1. Normally done by correlation engine
  1. Part of a management tool
    1. Compares events with a set of
      1. Criteria
      2. Called business criteria though fairly technical
      3. because events may impact business
      4. Rules can be used to determine business impact
      5. Rules
  2. Example:
    1. Number of similar events
    2. Number of CIs generating similar events
    3. Whether a specific action is associated with the code or data in the events
    4. Whether event represents an exceptions
    5. A comparison of utilization information in the events with max or min standard
    6. Whether further action required
    7. Categorisation of events
    8. Assigning a priority level to the events
2. Further action required?
  1. If correlation engine recognizes an event
    1. Further action is required
      1. Initialization of Incident Management process
      2. Initialization of Change Management process
      3. Investigation of Change Management if any change has created an event
      4. Executing a script for specific type of events
      5. Notifying the event to a person via mobile phone
Response Selection
1. Auto response
  1. Responses are already identified
  2. Result of good Service Design or Problem Management
  3. The response will initiate action and then evaluate whether completed successfully.
    1. If not, a an incident or problem record will be generated.
  4. Example: Rebooting a device or locking a device to protect against unauthorised access
2. Alert and Human Intervention
  1. If an event requires human intervention, it will need to be escalated
  2. An appropriate skilled person or team will handle the events
  3. The event to have all the necessary information for the skilled person or team to determine correct action
  4. Example: Changing a toner cartridge in a printer when the level is low
3. Incident, Problem or Change
  1. Some Events may need an action through Incident, Problem or Change management process
  2. Open an RFC
    1. There are two places where an RFC can be created
      1. When an exception occurs
      2. When it is identified that an unauthorised change has happened
      3. Open an RFC
      4. Do a proper investigation for the unauthorised change
      5. implies that the Change Management process is not effective
      6. Correlation Engine determines that a change is required
      7. It has to be determined at the Service Design stage
      8. Or it has happened before and Problem Management updated the Correlation Engine to take this action
  3. Open an Incident record
    1. An incident can be generated when an exception is detected
      1. When an incident record is opened as much information as possible should be included
      2. with links to possible events concerned
      3. if possible, a completed diagnostic script
    2. Open or Link to a problem Record
      1. It is rare for a problem record to be opened without related incident
      2. This step refers to linking an incident to an an existing Problem record
      3. This will assist the problem management teams to reassess the severity and impact of the problem
      4. This may result in a changed priority to an outstanding problem
      5. This will also allow to allow a root cause analysis
  4. Special types of Incident
    1. In some cases, an event will indicate an exception that does not directly impact any IT services
      1. Example: a redundant AC fails
      2. An unauthorised person has entered Data Centre
      3. An incident should be logged in using appropriate model
      4. The incident should be escalated to a specific group that manages that type of incident
      5. This type of incident should mention that it is an operational issue rather than a service issue
      6. no Outage
      7. should not be used to calculate any downtime
      8. Can be used to demonstrate how proactive IT has been in making service available
  5. Some events may require action on combination of these three processes
Review Actions
1. It is important to handle any significant events or exception appropriately
  1. In many cases, this is done automatically
2. It is intended to ensure that expected review of any change, incident or problem management created from any event, is not getting duplicated or lost.
  1. A proper design should be identified in the Design Phase
  2. The review should also be used as an input to Continual Improvement Process
Close event
1. Informational events simply logged
  1. Used as an input for other processes
2. Sometimes, it is very difficult to relate open and closed events if they are in different format
  1. It is suggested, that device in the infrastructure produces events in the same format
3. In the case of events that generated an incident, problem or change, these should be formally closed with a link to appropriate record from other process