-
Overview
- Maintenance windows and maintenance exclusions provide control over when cluster maintenance such as auto-upgrades can and cannot occur
- Maintenance windows and exclusions provide fine-grained control over when automatic maintenance can occur on clusters
- A maintenance window is an arbitrary, repeating window of time during which automatic maintenance is permitted
- A maintenance exclusion is an arbitrary non-repeating window of time during which automatic maintenance is forbidden
- Maintenance windows and maintenance exclusions can be configured separately and independently
- Multiple maintenance exclusions can be configured
- Google performs maintenance tasks on clusters as needed, or when configuration change that re-creates nodes or networks in the cluster are made
- Configuration changes include auto-upgrades to cluster masters in accordance with GKE's version policy and node auto-upgrades, if enabled
- User-initiated configuration changes such as optimizing IP address allocation can fundamentally change the cluster's internal network topology
- A zonal cluster cannot be modified while its master is upgraded (including deploying workloads)
- Upgrades can cause temporary disruptions while moving workloads off each node as it is re-created
- Maintenance windows allow users to control when automatic upgrades of masters and nodes can occur, to mitigate potential transient disruptions to workloads
- Maintenance windows are useful in off-peak hours to minimize the chance of downtime by scheduling automatic upgrades during off-peak hours when traffic is reduced
- It ensures that upgrades happen during working hours so that someone can monitor the upgrades and manage any unanticipated issues
- Uses multi-cluster upgrades to roll out upgrades across multiple clusters in different regions one at a time at specified intervals
- In addition to automatic upgrades, Google may occasionally need to perform other maintenance tasks, and honors a cluster's maintenance window if at all possible
- If tasks run beyond the maintenance window, GKE attempts to pause the operation, and attempts to resume it during the next maintenance window
- GKE reserves the right to roll out unplanned, emergency upgrades outside of maintenance windows
- Mandatory upgrades to upgrade from deprecated or outdated software might automatically occur outside of maintenance windows
- You can also manually upgrade clusters at any time
- Manually-initiated upgrades begin immediately and ignore any maintenance windows
- Users can configure a maintenance window for a new or existing cluster
- Maintenance windows and exclusions can cause security patches to be delayed
- GKE reserves the right to override maintenance policies for critical security vulnerabilities
- GKE clusters and workloads can also be impacted by automatic maintenance on other, dependent services, such as Compute Engine
- Maintenance windows and exclusions do not affect automatic maintenance on other services
- GKE performs automated repairs on cluster masters
- This includes processes like upscaling the master VM to an appropriate size or restarting the master VM to resolve issues
- Most repairs ignore maintenance windows and exclusions because failing to perform the repairs can result in non-functional clusters
- Repairing masters cannot be disabled
- Nodes also have auto-repair functionality that cannot be disabled
- When a feature is enabled or modified or options such as those that impact networking between the cluster masters and nodes, the nodes are recreated to apply the new configuration
- If maintenance windows are in use and a feature or option enabled or modified that requires nodes to be recreated, the new configuration is applied to the nodes only during a maintenance window
- If a user prefers not to wait, they can manually "upgrade" the node pool to the same version it is already using
- Users can only configure a single maintenance window per cluster
- Configuring a new maintenance window overwrites the previous one
-
Upgrades
- Google initiates auto-upgrade
- Google observes automatic and manual upgrades across all GKE clusters, and intervenes if problems are observed
- A cluster's control plane (master) and nodes do not necessarily run the same version at all times
- A cluster master is upgraded before its nodes
- Zonal and multi-zonal clusters have only a single control plane (master)
- During the upgrade, workloads continue to run, but workloads cannot be modified or deployed, or changes made to the cluster configuration until the upgrade is complete
- Regional clusters have multiple replicas of the control plane, and only one replica is upgraded at a time, in an undefined order
- During the upgrade, the cluster remains highly available, and each control plane replica is unavailable only while it is being upgraded
- Where a maintenance window or exclusion is configured, it is honored if possible
- Cluster and its node pools do not necessarily run the same version of GKE
- Node pools are upgraded one at a time
- Within a node pool, nodes are upgraded one at a time, in an undefined order
- The number of nodes upgraded at a time cannot be changed
- Where a maintenance window or exclusion is configured, it is honored if possible
- If Surge upgrades are enabled, GKE creates a new surge node with the upgraded version and waits for it to be registered with the master
- GKE selects an existing node (the target node) to upgrade
- It cordons and starts draining the target node. GKE can't schedule new Pods on the target node
- Pods on the target node are rescheduled onto other nodes
- If a Pod can't be rescheduled, that Pod stays PENDING until it can be rescheduled
- If a surge node was created, the target node is deleted
- If a surge node wasn't created, GKE upgrades the target node then it waits for the node to be registered
- If a significant number of node auto-upgrades to a given version result in unhealthy nodes across the GKE fleet, upgrades to that version are halted while the problem is investigated
- Google upgrades the clusters when a new GKE version is selected for auto-upgrade
- Configure maintenance windows and exclusions for more control over when an auto-upgrade can occur or must not occur
- A cluster's node pools can be no more than two minor versions behind the control plane version, to maintain compatibility with the cluster API
- The node pool version also determines the versions of software packages installed on each node
- It is recommended to keep node pools updated to the cluster version
- Where clusters are enrolled in a release channel, nodes always run the same version of GKE as the cluster, except during a brief period between completing the cluster's control plane upgrade and beginning to upgrade a given node pool
- New GKE versions are released regularly, but a version is not selected for auto-upgrade right away
- When a GKE version has accumulated enough cluster usage to prove stability over time, Google selects it as an auto-upgrade target for clusters running a subset of older versions
- New auto-upgrade targets are announced in the release notes
- Until an available version is selected for auto-upgrade, you can upgrade to it manually
- Soon after a new minor version becomes generally available, the oldest available minor version typically becomes unsupported
- Clusters running minor versions that become unsupported are automatically upgraded to the next minor version
- Release channels allow users to control cluster and node pool version based on a version's stability rather than managing the version directly
- Node auto-upgrade is not available for Alpha clusters. Also, a
- Alpha clusters cannot be enrolled in release channels
- By default, auto-upgrades can occur at any time
- Auto-upgrades are minimally disruptive, especially for regional clusters
- Users can configure maintenance windows and exclusions to manage when auto-upgrades can and must not occur
- If a user configures maintenance windows and exclusions, the upgrade does not occur until the current time is within a maintenance window
- If a maintenance window expires before the upgrade completes, an attempt is made to pause it
- During the next occurrence maintenance window, an attempt is made to resume the upgrade
- Users can request to manually upgrade a cluster or its node pools to an available and compatible version at any time
- Manual upgrades bypass any configured maintenance windows and maintenance exclusions
- For zonal and multi-zonal clusters, the control plane is unavailable while it is being upgraded
- For the most part, workloads run normally but cannot be modified during the upgrade
- For regional clusters, one replica of the control plane is unavailable at a time while it is upgraded, but the cluster remains highly available during the upgrade
- Users can manually initiate a node upgrade to a version compatible with the control plane
- Surge upgrades enables control over the number of nodes GKE can upgrade at a time and control how disruptive upgrades are to your workloads
- Users can change how many nodes GKE attempts to upgrade at once by changing the surge upgrade parameters on a node pool
- Surge upgrades reduce disruption to workloads during cluster maintenance and also allows control over the number of nodes upgraded in parallel
- Surge upgrades also work with the Cluster Autoscaler to prevent changes to nodes that are being upgraded
- Surge upgrade behavior is determined by max-surge-upgrade, the number of additional nodes that can be added to the node pool during an upgrade
- Nodes created by surge upgrade are subject to Google Cloud resource quotas or reservations, such as the quota for Compute Engine VMs
- If enough quota or additional nodes cannot be provisioned, the upgrade will fail
- max-unavailable-upgrade is the number of nodes that can be simultaneously unavailable during an upgrade
- Increasing max-unavailable-upgrade raises the number of nodes that can be upgraded in parallel
- If max-unavailable-upgrade is set to 0 users might still experience downtime during an upgrade while workloads restart after moving between nodes
- The number of nodes upgraded simultaneously is the sum of max-surge-upgrade and max-unavailable-upgrade
- Surge upgrade parameters can be configured for node pools that use auto-upgrades and manual upgrades
- While recreating nodes does not require additional Compute Engine resources, surge upgrading nodes does
- Depending on configuration, this quota can limit the number of parallel upgrades or even cause the upgrade to fail
-
Releases
- Kubernetes releases updates often, to deliver security updates, fix known issues, and introduce new features
- Release channels provide control over which automatic updates a given cluster receives based on the stability requirements of the cluster and its workloads
- When a new cluster is enrolled into a release channel, Google automatically manages the version and upgrade cadence for the cluster and its node pools
- A version must meet increasing stability requirements to be eligible for a more stable channel, and more stable channels receive fewer, less frequent updates
- Rapid channel releases are made weekly
- Useful for non-production clusters that want to try out new Google Kubernetes Engine or Kubernetes features
- Not covered by the GKE SLA
- The latest features, before any other channel
- Potentially more unresolved issues than other channels, including the possibility of issues with no known workarounds
- Regular channel releases are made multiple times per month Production clusters that need features not yet offered in the Stable channel
- These versions are considered production-quality
- Known issues generally have known workarounds
- Stable channel releases are made every few months
- Production clusters that require stability above all else, and for which frequent upgrades are too risky
- These versions are considered production-quality, with historical data to indicate that they are stable and reliable in production
- When a cluster is enrolled in a release channel, that cluster is upgraded automatically when a new version is available in that channel
- The Rapid channel allows early access to test and validate new minor versions of GKE
- When a minor version has accumulated cumulative usage and demonstrated stability in the Rapid channel, its new patch releases are promoted to the Regular channel, and updates happen less frequently
- Eventually, the minor version is promoted to the Stable channel, which only receives high-priority updates
- Each promotion signals a graduating level of stability and production-readiness, based on observed performance of clusters running that version
- Critical security patches are delivered to all release channels, to protect clusters and Google's infrastructure
- Exact release schedules depend on multiple factors and cannot be guaranteed
- A cluster can be created that uses release channels to manage its version instead of using the default version or choosing a specific version
- The cluster only receives updates from that release channel
- An existing cluster cannot be included in a release channel
- When using release channels, a version is not specified because the version is managed automatically within the channel
- Auto-upgrade is enabled (and cannot be disabled), so the cluster is updated automatically from releases available in the chosen release channel
- It is currently not possible to change the release channel for a given cluster or disable release channels on a cluster where they are enabled
- To stop using release channels and go back to specifying an exact version, recreate the cluster without the --release-channel flag
- Clusters created using the Rapid release channel are not alpha clusters
- Clusters that use release channels can be upgraded, and auto-upgrade is enabled and cannot be disabled
- Alpha clusters cannot be upgraded
- Clusters that use release channels do not expire.
- Alpha clusters expire after 30 days
- Alpha Kubernetes APIs are not enabled on clusters that use release channels