Load Balancer Failover

Failover for Internal TCP/UDP Load Balancing
1. Configure an internal TCP/UDP load balancer to distribute connections among virtual machine (VM) instances in primary backends, and then switch, if needed, to using failover backends
2. Failover provides one method of increasing availability, while also giving greater control over how to manage workload when primary backend VMs aren't healthy
3. Configuring failover modifies the internal TCP/UDP load balancer's standard traffic distribution algorithm
4. By default, when a backend is added to an internal TCP/UDP load balancer's backend service, that backend is a primary backend
5. A backend can be designated to be a failover backend when it is added to the load balancer's backend service, or by editing the backend service later
6. Failover backends only receive connections from the load balancer after a configurable ratio of primary VMs don't pass health checks
Supported instance groups
1. Managed and unmanaged instance groups are supported as backends
2. Using managed instance groups with autoscaling and failover might cause the active pool to repeatedly failover and failback between the primary and failover backends
3. Google Cloud does not prevent users from configuring failover with managed instance groups
Backend instance groups and VMs
1. The unmanaged instance groups in Internal TCP/UDP Load Balancing are either primary backends or failover backends
2. You can designate a backend to be a failover backend when it is added to the backend service or by editing the backend after it is added
3. Otherwise, unmanaged instance groups are primary by default
4. Multiple primary backends and multiple failover backends can be configured in a single internal TCP/UDP load balancer by adding them to the load balancer's backend service
5. A primary VM is a member of an instance group defined to be a primary backend
6. The VMs in a primary backend participate in the load balancer's active pool, unless the load balancer switches to using its failover backends
7. A backup VM is a member of an instance group defined to be a failover backend
8. The VMs in a failover backend participate in the load balancer's active pool when primary VMs become unhealthy
9. The number of unhealthy VMs that triggers failover is a configurable percentage
Active pool
1. The active pool is the collection of backend VMs to which an internal TCP/UDP load balancer sends new connections
2. Membership of backend VMs in the active pool is computed automatically based on which backends are healthy and conditions that can be specified
3. The active pool never combines primary VMs and backup VMs
4. During failover, the active pool contains only backup VMs
5. During normal operation (failback), the active pool contains only primary VMs
Failover and failback
1. Failover and failback are the automatic processes that switch backend VMs into or out of the load balancer's active pool
2. When Google Cloud removes primary VMs from the active pool and adds healthy failover VMs to the active pool, the process is called failover
3. When Google Cloud reverses this, the process is called failback
Failover policy
1. A failover policy is a collection of parameters that Google Cloud uses for failover and failback
2. Each internal TCP/UDP load balancer has one failover policy that has multiple settings
Failover ratio
1. Dropping traffic when all backend VMs are unhealthy
2. Connection draining on failover and failback
3. A configurable failover ratio determines when Google Cloud performs a failover or failback, changing membership in the active pool
4. A failover ratio of 1.0 requires that all primary VMs be healthy
5. When at least one primary VM becomes unhealthy, Google Cloud performs a failover, moving the backup VMs into the active pool
6. A failover ratio of 0.1 requires that at least 10% of the primary VMs be healthy; otherwise, Google Cloud performs a failover
7. A failover ratio of 0.0 means that Google Cloud performs a failover only when all the primary VMs are unhealthy
8. Failover doesn't happen if at least one primary VM is healthy
Internal TCP/UDP load balancer
1. An internal TCP/UDP load balancer distributes connections among VMs in the active pool according to the traffic distribution algorithm
2. Dropping traffic when all backend VMs are unhealthy
3. By default, when all primary and backup VMs are unhealthy, Google Cloud distributes new connections among all primary VMs
4. It does so as a last resort
5. Internal TCP/UDP load balancers can be configured to drop new connections when all primary and backup VMs are unhealthy
Connection draining on failover and failback
1. Connection draining allows existing TCP sessions to remain active for up to a configurable time period even after backend VMs become unhealthy
2. If the protocol for load balancer is TCP
  1. By default, connection draining is enabled
  2. Existing TCP sessions can persist on a backend VM for up to 300 seconds (5 minutes), even if the backend VM becomes unhealthy or isn't in the load balancer's active pool
3. Disabling connection draining during failover and failback ensures that all TCP sessions, including established ones, are quickly terminated
4. Users can disable connection draining during failover and failback events
5. Connections to backend VMs might be closed with a TCP reset packet
6. Disabling connection draining on failover and failback is useful for scenarios such as
  1. Patching backend VMs. Prior to patching, configure your primary VMs to fail health checks so that the load balancer performs a failover
7. Disabling connection draining ensures that all connections are moved to the backup VMs quickly and in a planned fashion
8. This allows users to install updates and restart the primary VMs without existing connections persisting
9. After patching, Google Cloud can perform a failback when a sufficient number of primary VMs (as defined by the failover ratio) pass their health check
10. If there is a need to ensure that only one primary VM is the destination for all connections, disable connection draining so that switching from a primary to a backup VM does not allow existing connections to persist on both
11. This reduces the possibility of data inconsistencies by keeping just one backend VM active at any given time