-
Overview
- Suite of tools for monitoring, logging, and tracking diagnostics
- Native multi-cloud monitoring of cloud resources
- Dynamically discovers all cloud resources
- Find and fix problems faster before they occur
- More relevant alerts and better signal-to-noise ratio
-
Cloud Logging
-
Overview
- Single repository for log data and events from multiple sources
- Collect platform, system and application logs (with agents)
- Enables users to store, search and analyse logs
- Delivers realtime and batch monitoring
- Tight integration with Cloud Monitoring
- Export logs for long term storage and analysis
-
Design
- Who did what, where and when
- Associated by project and shows logs for one project
- Log entry records log name, status or event
- Logs are a named collection of log entries
- Retention depends on type of log
-
Types
-
Agent Logs
- Agent installed on VMs
- Records VM system and 3rd party app logs
- Incurs charges beyond free tier
-
Admin Activity Logs
- Always on, immutable, no charge
- Administrative actions and API calls
- 400 days retention
-
System Event Logs
- Always on, immutable, no charge
- Cloud system events, e.g. live migration
- 400 days retention
-
Data Access Logs
- Logs API calls that create, modify or read user-provided data
- Not on by default. Can become large
- Charged beyond the free tier
- 30 days retention
-
Access Transparency Logs
- Logs actions taken by Google staff when accessing data
- Enabled for the entire organization
- Enterprise support is needed
- 400 days retention
-
Export
- Cloud Storage for long term retention
- BigQuery for big data analysis
- Pub/Sub for streaming to other sources
- Centralized logging integrates with 3rd party products, e.g. Splunk
-
Operations
- Requires a project and destination service
- Create filter and select log entries to export
- Choose destination, Cloud Storage, BigQuery
- Filter and destination determine entries to copy to a destination
- Only new entries are exported after sink creation
- Use Cloud Audit logs to regularly audit organization
- Grant roles to a Google group instead of individual users
- Use Google groups to grant multiple roles to jobs
-
Roles
- Admin has full control and can grant access to other users
- Viewer can view logs
- Writer can grant service accounts the ability to create logs
- Configuration Writer can create metrics and export sinks
-
Cloud Monitoring
-
Overview
- Full stack monitoring with health checks, dashboards and alerts
- Indicates what is up, down, overloaded
- Native monitoring of GCP, AWS and 3rd party applications
- Monitors system and application metrics
- Easy to view insights with dashboards and alerts
- Uptime checks on external applications
- Integrates with Cloud Logging, Hipchat, PagerDuty etc
-
Agent
- Without Agent, CPU, network traffic and uptime info
- Agent access additional resource and application service info
- Agent is installed on VM
- Monitors 3rd party apps
-
Best Practices
- Create a single project for Cloud Monitoring
- Enable Metric Scope for monitoring resources across projects
- Use separate accounts and metric scopes for data and control isolation
- Determine monitoring needs in advance
-
Cloud Trace
- Helps to understand how long it takes an application to handling incoming requests (latency)
- Collects latency data form AppEngine, HTTP(S) load balancers, and applications using Cloud Trace API
- Integrated with AppEngine Standard (automatic)
- Can be installed on Compute and Kubernetes Engine
- Can be installed on non-GCP environments
-
Alerting
-
Policies
- An alerting policy defines the conditions under which a service is considered healthy
- When conditions are met, the policy is triggered, opens a new incident and/or sends notification
- A policy belongs to an individual metric scope
-
Conditions
- Determines when an alerting policy is triggered
- All conditions watch for some metric, behaving in some way, and for some period of time
-
Describing a condition includes
- A metric to be measured
- A test for determining when that metric reaches a state of interest
-
Notification channels
- How to be notified
- Email, pagerduty, slack, SMS etc
-
Cloud Profiler
- Profiler collects CPU/memory data to optimize performance
- Profiles resource intensive application components
- Collects CPU/RAM usage data to identify high resource usage components
-
Cloud Debugger
- Use debugger to find and fix errors in production
- Capture and inspect call stack and local variables in applications (snapshot)
- Inspect application state without stopping or slowing application
- Log points allow users to inject logging into running services
- Does not require adding log statements
- Can be used with or without access to application source code
- If repo not local, can be hooked into remote Git repo
- Can be installed on Compute Engine, Kubernetes Engine, AppEngine and Cloud Run
-
Error Reporting
- Aggregates errors into a single view
- Display errors in a time series
- Simplifies data visualisation by grouping
- Provides easy access to stacktrace