1. Overview
    1. Before designing a HA service, understand the characteristics of the application, the filesystem, and the operating system
    2. These characteristics are the basis for the design and can rule out various approaches
    3. Know the impact on application and write performance
    4. Determine the service recovery time objective
    5. Understand how quickly the service must recover from a zonal outage, and the SLA requirements
    6. Understand the cost to build a resilient and reliable service architecture
    7. Consider the implications of application synchronous and application asynchronous replication using two instances of the database and VM.
    8. Determine VM instance costs, persistent disk costs and costs of maintaining application replication
    9. To achieve high availability with a regional persistent disk, use the same VM instance and persistent disk components but also include a regional persistent disk
    10. Regional persistent disks are double the cost per byte compared to zonal persistent disks because they are replicated in two zones
    11. Using regional persistent disks might reduce maintenance cost because the data is automatically replicated to two replicas without the requirement of maintaining application replication
    12. Host costs can be reduced even more by only starting the back-up VM on demand during failover rather than maintaining the VM as a hot standby
  2. Regional persistent disks
    1. Regional persistent disk is a storage option that provides synchronous replication of data between two zones in a region
    2. Regional persistent disks can be a good building block to use when implementing HA services in Compute Engine
    3. The benefit of regional persistent disks is that in the event of a zonal outage, where VM instance might become unavailable, a regional persistent disk can be force-attached to a VM instance in a secondary zone in the same region
    4. To perform this task, either start another VM instance in the same zone as the regional persistent disk being force-attached, or maintain a hot standby VM instance in that zone
    5. A "hot standby" is a running VM instance that is identical to the one in use. The two instances have the same data
    6. The force-attach operation executes in less than one minute, which achieves a recovery time objective (RTO) in minutes
    7. The total RTO depends not only on the storage failover (the force-attachment of the regional persistent disk), but also on whether a secondary VM instance must be created first, the length of time the underlying filesystem detects a hot-attached disk, the recovery time of the corresponding applications, and other factors
  3. HA Database
    1. An application might still become unavailable in case of broader outages, for example, if a whole region becomes unavailable
    2. Depending on the needs, it might be necessary to consider cross-regional replication techniques for higher availability
    3. Database HA configurations typically have at least two VM instances
    4. Preferably these instances are part of one or more managed instance groups
    5. A primary VM instance in the primary zone
    6. A standby VM instance in a secondary zone
    7. A primary VM instance has at least two persistent disks: a boot disk, and a regional persistent disk
    8. The regional persistent disk contains database data and any other mutable data that should be preserved to another zone in case of an outage.
    9. A standby VM instance requires a separate boot disk to be able to recover from configuration-related outages, which could result from an operating system upgrade, for example.
    10. A boot disk cannot be force-attached to another VM during a failover.
    11. Primary and standby VM instances are configured to use a load balancer with the traffic directed to the primary VM based on health check signals.
    12. This configuration is also known as a hot standby.
  4. Healthcheck
    1. Health checks are implemented by the health check agent
    2. The health check agent resides within the primary and secondary VMs to monitor the instances and communicate with the load balancer to direct traffic
    3. This is particularly useful with instance groups
    4. The health check agent syncs with the application-specific regional control plane and makes failover decisions based on control plane behavior
    5. The control plane must be in a zone that differs from the instance whose health it is monitoring
    6. The health check agent itself must be fault-tolerant
  5. Failover
    1. When a failure is detected within a primary VM or database, the application control plane can initiate failover to the standby VM in the secondary zone
    2. During the failover, the regional persistent disk that is synchronously replicated to the secondary zone is force-attached to the standby VM by the application control plane and all traffic is directed to that VM based on health check signals.
    3. Overall failover latency, excluding failure-detection time, depends on the time required for application initialization and crash recovery
    4. Regional persistent disk adds another key building block for architecting HA solutions by providing disk-level replication.
  6. Live Migration
    1. Compute Engine offers live migration to keep virtual machine instances running even when a host system event occurs, such as a software or hardware update
    2. Compute Engine live migrates running instances to another host in the same zone rather than requiring VMs to be rebooted
    3. This allows Google to perform maintenance that is integral to keeping infrastructure protected and reliable without interrupting any VMs
    4. Live migration keeps instances running during regular infrastructure maintenance and upgrades
    5. Live migration keeps instances running during failed hardware such as memory, CPU, network interface cards, disks and power, host OS and BIOS upgrades.
    6. This is done on a best-effort basis
    7. If a hardware fails completely or otherwise prevents live migration, the VM crashes and restarts automatically and a hostError is logged
    8. Live migration keeps instances running during security-related updates, with the need to respond quickly
    9. Live migration keeps instances running during system configuration changes, including changing the size of the host root partition, for storage of the host image and packages
    10. Live migration does not change any attributes or properties of the VM itself
    11. The live migration process just transfers a running VM from one host machine to another host machine within the same zone
    12. All VM properties and attributes remain unchanged, including internal and external IP addresses, instance metadata, block storage data and volumes, OS and application state, network settings, network connections, etc
    13. Instances with GPUs attached cannot be live migrated
    14. They must be set to terminate and optionally restart
    15. Compute Engine offers a 60-minute notice before a VM instance with a GPU attached is terminated
    16. Compute Engine can also live migrate instances with local SSDs attached, moving the VMs along with their local SSD to a new machine in advance of any planned maintenance
    17. A preemptible instance cannot live migrate
    18. The maintenance behavior for preemptible instances is always set to TERMINATE by default, and can't be changed
    19. It is not possible to set the automatic restart option for preemptible instances, but preemptible instances can be manually started
    20. If an instance needs to be changed to no longer be preemptible, detach the boot disk from the preemptible instance and attach it to a new instance that is not configured to be preemptible
    21. A snapshot of the boot disk can be created and used to create a new instance without preemptibility