Kubernetes Engine Maintenance

Overview
1. Maintenance windows and maintenance exclusions provide control over when cluster maintenance such as auto-upgrades can and cannot occur
2. Maintenance windows and exclusions provide fine-grained control over when automatic maintenance can occur on clusters
3. A maintenance window is an arbitrary, repeating window of time during which automatic maintenance is permitted
4. A maintenance exclusion is an arbitrary non-repeating window of time during which automatic maintenance is forbidden
5. Maintenance windows and maintenance exclusions can be configured separately and independently
6. Multiple maintenance exclusions can be configured
7. Google performs maintenance tasks on clusters as needed, or when configuration change that re-creates nodes or networks in the cluster are made
8. Configuration changes include auto-upgrades to cluster masters in accordance with GKE's version policy and node auto-upgrades, if enabled
9. User-initiated configuration changes such as optimizing IP address allocation can fundamentally change the cluster's internal network topology
10. A zonal cluster cannot be modified while its master is upgraded (including deploying workloads)
11. Upgrades can cause temporary disruptions while moving workloads off each node as it is re-created
12. Maintenance windows allow users to control when automatic upgrades of masters and nodes can occur, to mitigate potential transient disruptions to workloads
13. Maintenance windows are useful in off-peak hours to minimize the chance of downtime by scheduling automatic upgrades during off-peak hours when traffic is reduced
14. It ensures that upgrades happen during working hours so that someone can monitor the upgrades and manage any unanticipated issues
15. Uses multi-cluster upgrades to roll out upgrades across multiple clusters in different regions one at a time at specified intervals
16. In addition to automatic upgrades, Google may occasionally need to perform other maintenance tasks, and honors a cluster's maintenance window if at all possible
17. If tasks run beyond the maintenance window, GKE attempts to pause the operation, and attempts to resume it during the next maintenance window
18. GKE reserves the right to roll out unplanned, emergency upgrades outside of maintenance windows
19. Mandatory upgrades to upgrade from deprecated or outdated software might automatically occur outside of maintenance windows
20. You can also manually upgrade clusters at any time
21. Manually-initiated upgrades begin immediately and ignore any maintenance windows
22. Users can configure a maintenance window for a new or existing cluster
23. Maintenance windows and exclusions can cause security patches to be delayed
24. GKE reserves the right to override maintenance policies for critical security vulnerabilities
25. GKE clusters and workloads can also be impacted by automatic maintenance on other, dependent services, such as Compute Engine
26. Maintenance windows and exclusions do not affect automatic maintenance on other services
27. GKE performs automated repairs on cluster masters
28. This includes processes like upscaling the master VM to an appropriate size or restarting the master VM to resolve issues
29. Most repairs ignore maintenance windows and exclusions because failing to perform the repairs can result in non-functional clusters
30. Repairing masters cannot be disabled
31. Nodes also have auto-repair functionality that cannot be disabled
32. When a feature is enabled or modified or options such as those that impact networking between the cluster masters and nodes, the nodes are recreated to apply the new configuration
33. If maintenance windows are in use and a feature or option enabled or modified that requires nodes to be recreated, the new configuration is applied to the nodes only during a maintenance window
34. If a user prefers not to wait, they can manually "upgrade" the node pool to the same version it is already using
35. Users can only configure a single maintenance window per cluster
36. Configuring a new maintenance window overwrites the previous one
Upgrades
1. Google initiates auto-upgrade
2. Google observes automatic and manual upgrades across all GKE clusters, and intervenes if problems are observed
3. A cluster's control plane (master) and nodes do not necessarily run the same version at all times
4. A cluster master is upgraded before its nodes
5. Zonal and multi-zonal clusters have only a single control plane (master)
6. During the upgrade, workloads continue to run, but workloads cannot be modified or deployed, or changes made to the cluster configuration until the upgrade is complete
7. Regional clusters have multiple replicas of the control plane, and only one replica is upgraded at a time, in an undefined order
8. During the upgrade, the cluster remains highly available, and each control plane replica is unavailable only while it is being upgraded
9. Where a maintenance window or exclusion is configured, it is honored if possible
10. Cluster and its node pools do not necessarily run the same version of GKE
11. Node pools are upgraded one at a time
12. Within a node pool, nodes are upgraded one at a time, in an undefined order
13. The number of nodes upgraded at a time cannot be changed
14. Where a maintenance window or exclusion is configured, it is honored if possible
15. If Surge upgrades are enabled, GKE creates a new surge node with the upgraded version and waits for it to be registered with the master
16. GKE selects an existing node (the target node) to upgrade
17. It cordons and starts draining the target node. GKE can't schedule new Pods on the target node
18. Pods on the target node are rescheduled onto other nodes
19. If a Pod can't be rescheduled, that Pod stays PENDING until it can be rescheduled
20. If a surge node was created, the target node is deleted
21. If a surge node wasn't created, GKE upgrades the target node then it waits for the node to be registered
22. If a significant number of node auto-upgrades to a given version result in unhealthy nodes across the GKE fleet, upgrades to that version are halted while the problem is investigated
23. Google upgrades the clusters when a new GKE version is selected for auto-upgrade
24. Configure maintenance windows and exclusions for more control over when an auto-upgrade can occur or must not occur
25. A cluster's node pools can be no more than two minor versions behind the control plane version, to maintain compatibility with the cluster API
26. The node pool version also determines the versions of software packages installed on each node
27. It is recommended to keep node pools updated to the cluster version
28. Where clusters are enrolled in a release channel, nodes always run the same version of GKE as the cluster, except during a brief period between completing the cluster's control plane upgrade and beginning to upgrade a given node pool
29. New GKE versions are released regularly, but a version is not selected for auto-upgrade right away
30. When a GKE version has accumulated enough cluster usage to prove stability over time, Google selects it as an auto-upgrade target for clusters running a subset of older versions
31. New auto-upgrade targets are announced in the release notes
32. Until an available version is selected for auto-upgrade, you can upgrade to it manually
33. Soon after a new minor version becomes generally available, the oldest available minor version typically becomes unsupported
34. Clusters running minor versions that become unsupported are automatically upgraded to the next minor version
35. Release channels allow users to control cluster and node pool version based on a version's stability rather than managing the version directly
36. Node auto-upgrade is not available for Alpha clusters. Also, a
37. Alpha clusters cannot be enrolled in release channels
38. By default, auto-upgrades can occur at any time
39. Auto-upgrades are minimally disruptive, especially for regional clusters
40. Users can configure maintenance windows and exclusions to manage when auto-upgrades can and must not occur
41. If a user configures maintenance windows and exclusions, the upgrade does not occur until the current time is within a maintenance window
42. If a maintenance window expires before the upgrade completes, an attempt is made to pause it
43. During the next occurrence maintenance window, an attempt is made to resume the upgrade
44. Users can request to manually upgrade a cluster or its node pools to an available and compatible version at any time
45. Manual upgrades bypass any configured maintenance windows and maintenance exclusions
46. For zonal and multi-zonal clusters, the control plane is unavailable while it is being upgraded
47. For the most part, workloads run normally but cannot be modified during the upgrade
48. For regional clusters, one replica of the control plane is unavailable at a time while it is upgraded, but the cluster remains highly available during the upgrade
49. Users can manually initiate a node upgrade to a version compatible with the control plane
50. Surge upgrades enables control over the number of nodes GKE can upgrade at a time and control how disruptive upgrades are to your workloads
51. Users can change how many nodes GKE attempts to upgrade at once by changing the surge upgrade parameters on a node pool
52. Surge upgrades reduce disruption to workloads during cluster maintenance and also allows control over the number of nodes upgraded in parallel
53. Surge upgrades also work with the Cluster Autoscaler to prevent changes to nodes that are being upgraded
54. Surge upgrade behavior is determined by max-surge-upgrade, the number of additional nodes that can be added to the node pool during an upgrade
55. Nodes created by surge upgrade are subject to Google Cloud resource quotas or reservations, such as the quota for Compute Engine VMs
56. If enough quota or additional nodes cannot be provisioned, the upgrade will fail
57. max-unavailable-upgrade is the number of nodes that can be simultaneously unavailable during an upgrade
58. Increasing max-unavailable-upgrade raises the number of nodes that can be upgraded in parallel
59. If max-unavailable-upgrade is set to 0 users might still experience downtime during an upgrade while workloads restart after moving between nodes
60. The number of nodes upgraded simultaneously is the sum of max-surge-upgrade and max-unavailable-upgrade
61. Surge upgrade parameters can be configured for node pools that use auto-upgrades and manual upgrades
62. While recreating nodes does not require additional Compute Engine resources, surge upgrading nodes does
63. Depending on configuration, this quota can limit the number of parallel upgrades or even cause the upgrade to fail
Releases
1. Kubernetes releases updates often, to deliver security updates, fix known issues, and introduce new features
2. Release channels provide control over which automatic updates a given cluster receives based on the stability requirements of the cluster and its workloads
3. When a new cluster is enrolled into a release channel, Google automatically manages the version and upgrade cadence for the cluster and its node pools
4. A version must meet increasing stability requirements to be eligible for a more stable channel, and more stable channels receive fewer, less frequent updates
5. Rapid channel releases are made weekly
6. Useful for non-production clusters that want to try out new Google Kubernetes Engine or Kubernetes features
7. Not covered by the GKE SLA
8. The latest features, before any other channel
9. Potentially more unresolved issues than other channels, including the possibility of issues with no known workarounds
10. Regular channel releases are made multiple times per month Production clusters that need features not yet offered in the Stable channel
11. These versions are considered production-quality
12. Known issues generally have known workarounds
13. Stable channel releases are made every few months
14. Production clusters that require stability above all else, and for which frequent upgrades are too risky
15. These versions are considered production-quality, with historical data to indicate that they are stable and reliable in production
16. When a cluster is enrolled in a release channel, that cluster is upgraded automatically when a new version is available in that channel
17. The Rapid channel allows early access to test and validate new minor versions of GKE
18. When a minor version has accumulated cumulative usage and demonstrated stability in the Rapid channel, its new patch releases are promoted to the Regular channel, and updates happen less frequently
19. Eventually, the minor version is promoted to the Stable channel, which only receives high-priority updates
20. Each promotion signals a graduating level of stability and production-readiness, based on observed performance of clusters running that version
21. Critical security patches are delivered to all release channels, to protect clusters and Google's infrastructure
22. Exact release schedules depend on multiple factors and cannot be guaranteed
23. A cluster can be created that uses release channels to manage its version instead of using the default version or choosing a specific version
24. The cluster only receives updates from that release channel
25. An existing cluster cannot be included in a release channel
26. When using release channels, a version is not specified because the version is managed automatically within the channel
27. Auto-upgrade is enabled (and cannot be disabled), so the cluster is updated automatically from releases available in the chosen release channel
28. It is currently not possible to change the release channel for a given cluster or disable release channels on a cluster where they are enabled
29. To stop using release channels and go back to specifying an exact version, recreate the cluster without the --release-channel flag
30. Clusters created using the Rapid release channel are not alpha clusters
31. Clusters that use release channels can be upgraded, and auto-upgrade is enabled and cannot be disabled
32. Alpha clusters cannot be upgraded
33. Clusters that use release channels do not expire.
34. Alpha clusters expire after 30 days
35. Alpha Kubernetes APIs are not enabled on clusters that use release channels