Cloud Monitoring and Logging

Overview
1. Suite of tools for monitoring, logging, and tracking diagnostics
2. Native multi-cloud monitoring of cloud resources
3. Dynamically discovers all cloud resources
4. Find and fix problems faster before they occur
5. More relevant alerts and better signal-to-noise ratio
Cloud Logging
1. Overview
  1. Single repository for log data and events from multiple sources
  2. Collect platform, system and application logs (with agents)
  3. Enables users to store, search and analyse logs
  4. Delivers realtime and batch monitoring
  5. Tight integration with Cloud Monitoring
  6. Export logs for long term storage and analysis
2. Design
  1. Who did what, where and when
  2. Associated by project and shows logs for one project
  3. Log entry records log name, status or event
  4. Logs are a named collection of log entries
  5. Retention depends on type of log
3. Types
  1. Agent Logs
    1. Agent installed on VMs
    2. Records VM system and 3rd party app logs
    3. Incurs charges beyond free tier
  2. Admin Activity Logs
    1. Always on, immutable, no charge
    2. Administrative actions and API calls
    3. 400 days retention
  3. System Event Logs
    1. Always on, immutable, no charge
    2. Cloud system events, e.g. live migration
    3. 400 days retention
  4. Data Access Logs
    1. Logs API calls that create, modify or read user-provided data
    2. Not on by default. Can become large
    3. Charged beyond the free tier
    4. 30 days retention
  5. Access Transparency Logs
    1. Logs actions taken by Google staff when accessing data
    2. Enabled for the entire organization
    3. Enterprise support is needed
    4. 400 days retention
4. Export
  1. Cloud Storage for long term retention
  2. BigQuery for big data analysis
  3. Pub/Sub for streaming to other sources
  4. Centralized logging integrates with 3rd party products, e.g. Splunk
5. Operations
  1. Requires a project and destination service
  2. Create filter and select log entries to export
  3. Choose destination, Cloud Storage, BigQuery
  4. Filter and destination determine entries to copy to a destination
  5. Only new entries are exported after sink creation
  6. Use Cloud Audit logs to regularly audit organization
  7. Grant roles to a Google group instead of individual users
  8. Use Google groups to grant multiple roles to jobs
6. Roles
  1. Admin has full control and can grant access to other users
  2. Viewer can view logs
  3. Writer can grant service accounts the ability to create logs
  4. Configuration Writer can create metrics and export sinks
Cloud Monitoring
1. Overview
  1. Full stack monitoring with health checks, dashboards and alerts
  2. Indicates what is up, down, overloaded
  3. Native monitoring of GCP, AWS and 3rd party applications
  4. Monitors system and application metrics
  5. Easy to view insights with dashboards and alerts
  6. Uptime checks on external applications
  7. Integrates with Cloud Logging, Hipchat, PagerDuty etc
2. Agent
  1. Without Agent, CPU, network traffic and uptime info
  2. Agent access additional resource and application service info
  3. Agent is installed on VM
  4. Monitors 3rd party apps
3. Best Practices
  1. Create a single project for Cloud Monitoring
  2. Enable Metric Scope for monitoring resources across projects
  3. Use separate accounts and metric scopes for data and control isolation
  4. Determine monitoring needs in advance
Cloud Trace
1. Helps to understand how long it takes an application to handling incoming requests (latency)
2. Collects latency data form AppEngine, HTTP(S) load balancers, and applications using Cloud Trace API
3. Integrated with AppEngine Standard (automatic)
4. Can be installed on Compute and Kubernetes Engine
5. Can be installed on non-GCP environments
Alerting
1. Policies
  1. An alerting policy defines the conditions under which a service is considered healthy
  2. When conditions are met, the policy is triggered, opens a new incident and/or sends notification
  3. A policy belongs to an individual metric scope
2. Conditions
  1. Determines when an alerting policy is triggered
  2. All conditions watch for some metric, behaving in some way, and for some period of time
  3. Describing a condition includes
    1. A metric to be measured
    2. A test for determining when that metric reaches a state of interest
3. Notification channels
  1. How to be notified
  2. Email, pagerduty, slack, SMS etc
Cloud Profiler
1. Profiler collects CPU/memory data to optimize performance
2. Profiles resource intensive application components
3. Collects CPU/RAM usage data to identify high resource usage components
Cloud Debugger
1. Use debugger to find and fix errors in production
2. Capture and inspect call stack and local variables in applications (snapshot)
3. Inspect application state without stopping or slowing application
4. Log points allow users to inject logging into running services
5. Does not require adding log statements
6. Can be used with or without access to application source code
7. If repo not local, can be hooked into remote Git repo
8. Can be installed on Compute Engine, Kubernetes Engine, AppEngine and Cloud Run
Error Reporting
1. Aggregates errors into a single view
2. Display errors in a time series
3. Simplifies data visualisation by grouping
4. Provides easy access to stacktrace