1. Overview
    1. Maintenance windows and maintenance exclusions provide control over when cluster maintenance such as auto-upgrades can and cannot occur
    2. Maintenance windows and exclusions provide fine-grained control over when automatic maintenance can occur on clusters
    3. A maintenance window is an arbitrary, repeating window of time during which automatic maintenance is permitted
    4. A maintenance exclusion is an arbitrary non-repeating window of time during which automatic maintenance is forbidden
    5. Maintenance windows and maintenance exclusions can be configured separately and independently
    6. Multiple maintenance exclusions can be configured
    7. Google performs maintenance tasks on clusters as needed, or when configuration change that re-creates nodes or networks in the cluster are made
    8. Configuration changes include auto-upgrades to cluster masters in accordance with GKE's version policy and node auto-upgrades, if enabled
    9. User-initiated configuration changes such as optimizing IP address allocation can fundamentally change the cluster's internal network topology
    10. A zonal cluster cannot be modified while its master is upgraded (including deploying workloads)
    11. Upgrades can cause temporary disruptions while moving workloads off each node as it is re-created
    12. Maintenance windows allow users to control when automatic upgrades of masters and nodes can occur, to mitigate potential transient disruptions to workloads
    13. Maintenance windows are useful in off-peak hours to minimize the chance of downtime by scheduling automatic upgrades during off-peak hours when traffic is reduced
    14. It ensures that upgrades happen during working hours so that someone can monitor the upgrades and manage any unanticipated issues
    15. Uses multi-cluster upgrades to roll out upgrades across multiple clusters in different regions one at a time at specified intervals
    16. In addition to automatic upgrades, Google may occasionally need to perform other maintenance tasks, and honors a cluster's maintenance window if at all possible
    17. If tasks run beyond the maintenance window, GKE attempts to pause the operation, and attempts to resume it during the next maintenance window
    18. GKE reserves the right to roll out unplanned, emergency upgrades outside of maintenance windows
    19. Mandatory upgrades to upgrade from deprecated or outdated software might automatically occur outside of maintenance windows
    20. You can also manually upgrade clusters at any time
    21. Manually-initiated upgrades begin immediately and ignore any maintenance windows
    22. Users can configure a maintenance window for a new or existing cluster
    23. Maintenance windows and exclusions can cause security patches to be delayed
    24. GKE reserves the right to override maintenance policies for critical security vulnerabilities
    25. GKE clusters and workloads can also be impacted by automatic maintenance on other, dependent services, such as Compute Engine
    26. Maintenance windows and exclusions do not affect automatic maintenance on other services
    27. GKE performs automated repairs on cluster masters
    28. This includes processes like upscaling the master VM to an appropriate size or restarting the master VM to resolve issues
    29. Most repairs ignore maintenance windows and exclusions because failing to perform the repairs can result in non-functional clusters
    30. Repairing masters cannot be disabled
    31. Nodes also have auto-repair functionality that cannot be disabled
    32. When a feature is enabled or modified or options such as those that impact networking between the cluster masters and nodes, the nodes are recreated to apply the new configuration
    33. If maintenance windows are in use and a feature or option enabled or modified that requires nodes to be recreated, the new configuration is applied to the nodes only during a maintenance window
    34. If a user prefers not to wait, they can manually "upgrade" the node pool to the same version it is already using
    35. Users can only configure a single maintenance window per cluster
    36. Configuring a new maintenance window overwrites the previous one
  2. Upgrades
    1. Google initiates auto-upgrade
    2. Google observes automatic and manual upgrades across all GKE clusters, and intervenes if problems are observed
    3. A cluster's control plane (master) and nodes do not necessarily run the same version at all times
    4. A cluster master is upgraded before its nodes
    5. Zonal and multi-zonal clusters have only a single control plane (master)
    6. During the upgrade, workloads continue to run, but workloads cannot be modified or deployed, or changes made to the cluster configuration until the upgrade is complete
    7. Regional clusters have multiple replicas of the control plane, and only one replica is upgraded at a time, in an undefined order
    8. During the upgrade, the cluster remains highly available, and each control plane replica is unavailable only while it is being upgraded
    9. Where a maintenance window or exclusion is configured, it is honored if possible
    10. Cluster and its node pools do not necessarily run the same version of GKE
    11. Node pools are upgraded one at a time
    12. Within a node pool, nodes are upgraded one at a time, in an undefined order
    13. The number of nodes upgraded at a time cannot be changed
    14. Where a maintenance window or exclusion is configured, it is honored if possible
    15. If Surge upgrades are enabled, GKE creates a new surge node with the upgraded version and waits for it to be registered with the master
    16. GKE selects an existing node (the target node) to upgrade
    17. It cordons and starts draining the target node. GKE can't schedule new Pods on the target node
    18. Pods on the target node are rescheduled onto other nodes
    19. If a Pod can't be rescheduled, that Pod stays PENDING until it can be rescheduled
    20. If a surge node was created, the target node is deleted
    21. If a surge node wasn't created, GKE upgrades the target node then it waits for the node to be registered
    22. If a significant number of node auto-upgrades to a given version result in unhealthy nodes across the GKE fleet, upgrades to that version are halted while the problem is investigated
    23. Google upgrades the clusters when a new GKE version is selected for auto-upgrade
    24. Configure maintenance windows and exclusions for more control over when an auto-upgrade can occur or must not occur
    25. A cluster's node pools can be no more than two minor versions behind the control plane version, to maintain compatibility with the cluster API
    26. The node pool version also determines the versions of software packages installed on each node
    27. It is recommended to keep node pools updated to the cluster version
    28. Where clusters are enrolled in a release channel, nodes always run the same version of GKE as the cluster, except during a brief period between completing the cluster's control plane upgrade and beginning to upgrade a given node pool
    29. New GKE versions are released regularly, but a version is not selected for auto-upgrade right away
    30. When a GKE version has accumulated enough cluster usage to prove stability over time, Google selects it as an auto-upgrade target for clusters running a subset of older versions
    31. New auto-upgrade targets are announced in the release notes
    32. Until an available version is selected for auto-upgrade, you can upgrade to it manually
    33. Soon after a new minor version becomes generally available, the oldest available minor version typically becomes unsupported
    34. Clusters running minor versions that become unsupported are automatically upgraded to the next minor version
    35. Release channels allow users to control cluster and node pool version based on a version's stability rather than managing the version directly
    36. Node auto-upgrade is not available for Alpha clusters. Also, a
    37. Alpha clusters cannot be enrolled in release channels
    38. By default, auto-upgrades can occur at any time
    39. Auto-upgrades are minimally disruptive, especially for regional clusters
    40. Users can configure maintenance windows and exclusions to manage when auto-upgrades can and must not occur
    41. If a user configures maintenance windows and exclusions, the upgrade does not occur until the current time is within a maintenance window
    42. If a maintenance window expires before the upgrade completes, an attempt is made to pause it
    43. During the next occurrence maintenance window, an attempt is made to resume the upgrade
    44. Users can request to manually upgrade a cluster or its node pools to an available and compatible version at any time
    45. Manual upgrades bypass any configured maintenance windows and maintenance exclusions
    46. For zonal and multi-zonal clusters, the control plane is unavailable while it is being upgraded
    47. For the most part, workloads run normally but cannot be modified during the upgrade
    48. For regional clusters, one replica of the control plane is unavailable at a time while it is upgraded, but the cluster remains highly available during the upgrade
    49. Users can manually initiate a node upgrade to a version compatible with the control plane
    50. Surge upgrades enables control over the number of nodes GKE can upgrade at a time and control how disruptive upgrades are to your workloads
    51. Users can change how many nodes GKE attempts to upgrade at once by changing the surge upgrade parameters on a node pool
    52. Surge upgrades reduce disruption to workloads during cluster maintenance and also allows control over the number of nodes upgraded in parallel
    53. Surge upgrades also work with the Cluster Autoscaler to prevent changes to nodes that are being upgraded
    54. Surge upgrade behavior is determined by max-surge-upgrade, the number of additional nodes that can be added to the node pool during an upgrade
    55. Nodes created by surge upgrade are subject to Google Cloud resource quotas or reservations, such as the quota for Compute Engine VMs
    56. If enough quota or additional nodes cannot be provisioned, the upgrade will fail
    57. max-unavailable-upgrade is the number of nodes that can be simultaneously unavailable during an upgrade
    58. Increasing max-unavailable-upgrade raises the number of nodes that can be upgraded in parallel
    59. If max-unavailable-upgrade is set to 0 users might still experience downtime during an upgrade while workloads restart after moving between nodes
    60. The number of nodes upgraded simultaneously is the sum of max-surge-upgrade and max-unavailable-upgrade
    61. Surge upgrade parameters can be configured for node pools that use auto-upgrades and manual upgrades
    62. While recreating nodes does not require additional Compute Engine resources, surge upgrading nodes does
    63. Depending on configuration, this quota can limit the number of parallel upgrades or even cause the upgrade to fail
  3. Releases
    1. Kubernetes releases updates often, to deliver security updates, fix known issues, and introduce new features
    2. Release channels provide control over which automatic updates a given cluster receives based on the stability requirements of the cluster and its workloads
    3. When a new cluster is enrolled into a release channel, Google automatically manages the version and upgrade cadence for the cluster and its node pools
    4. A version must meet increasing stability requirements to be eligible for a more stable channel, and more stable channels receive fewer, less frequent updates
    5. Rapid channel releases are made weekly
    6. Useful for non-production clusters that want to try out new Google Kubernetes Engine or Kubernetes features
    7. Not covered by the GKE SLA
    8. The latest features, before any other channel
    9. Potentially more unresolved issues than other channels, including the possibility of issues with no known workarounds
    10. Regular channel releases are made multiple times per month Production clusters that need features not yet offered in the Stable channel
    11. These versions are considered production-quality
    12. Known issues generally have known workarounds
    13. Stable channel releases are made every few months
    14. Production clusters that require stability above all else, and for which frequent upgrades are too risky
    15. These versions are considered production-quality, with historical data to indicate that they are stable and reliable in production
    16. When a cluster is enrolled in a release channel, that cluster is upgraded automatically when a new version is available in that channel
    17. The Rapid channel allows early access to test and validate new minor versions of GKE
    18. When a minor version has accumulated cumulative usage and demonstrated stability in the Rapid channel, its new patch releases are promoted to the Regular channel, and updates happen less frequently
    19. Eventually, the minor version is promoted to the Stable channel, which only receives high-priority updates
    20. Each promotion signals a graduating level of stability and production-readiness, based on observed performance of clusters running that version
    21. Critical security patches are delivered to all release channels, to protect clusters and Google's infrastructure
    22. Exact release schedules depend on multiple factors and cannot be guaranteed
    23. A cluster can be created that uses release channels to manage its version instead of using the default version or choosing a specific version
    24. The cluster only receives updates from that release channel
    25. An existing cluster cannot be included in a release channel
    26. When using release channels, a version is not specified because the version is managed automatically within the channel
    27. Auto-upgrade is enabled (and cannot be disabled), so the cluster is updated automatically from releases available in the chosen release channel
    28. It is currently not possible to change the release channel for a given cluster or disable release channels on a cluster where they are enabled
    29. To stop using release channels and go back to specifying an exact version, recreate the cluster without the --release-channel flag
    30. Clusters created using the Rapid release channel are not alpha clusters
    31. Clusters that use release channels can be upgraded, and auto-upgrade is enabled and cannot be disabled
    32. Alpha clusters cannot be upgraded
    33. Clusters that use release channels do not expire.
    34. Alpha clusters expire after 30 days
    35. Alpha Kubernetes APIs are not enabled on clusters that use release channels