I was recently on the job search Market in Tel Aviv and Jerusalem seeking a job as a Senior Devops. I found a great job At Coro Cybersecurity

Another company whose name I do not remember. I do remember that it was high floor and had a great play room for kids!

They gave me an assignment asking how to upgrade Kubernetes.

Build an Upgrade Process for Kubernetes on EKS with Two Node Groups to Avoid Downtime

Objective: Create a comprehensive upgrade plan for Kubernetes on Amazon EKS that involves two node groups, ensuring zero downtime during the upgrade process.
Upgrade is from version 1.24 to 1.25

Deliverables:

Upgrade Plan Document: A detailed document outlining the steps to upgrade Kubernetes on Amazon EKS with two node groups. The plan should include:
- Pre-upgrade checks and prerequisites.
- A step-by-step guide to safely upgrade the Kubernetes version on EKS.
- Strategies for handling the two node groups, including draining nodes, applying the upgrade, and validating the upgrade process.
- Rollback strategies in case of failures.

Kubernetes EKS Upgrade Plan from 1.24 to 1.25

Introduction

Read this whole document before starting the upgrade, then use it as a checklist while proceeding with the upgrade steps.

Embarking on a Kubernetes version upgrade is a stressful opportunity for your EKS cluster to harness the latest features, enhancements, and security improvements. This document presents a meticulously crafted upgrade plan, transitioning your cluster from version 1.24 to the newer 1.25, while ensuring zero downtime and seamless integration between two node groups.

Our goal is to empower your SRE/Devops team with the knowledge and tools necessary to maintain business continuity, boost overall performance, and foster a culture of innovation within your organization.

By following this comprehensive guide, you will:

Understand the importance of application compatibility, backups, and cluster health in ensuring a successful upgrade.
Gain insights into the significance of permissions, kubectl version, and the AWS Management Console, AWS CLI, or eksctl for a smooth transition.
Master the art of managing two node groups, including canary node group selection, draining nodes, upgrading nodes, and validating the upgrade process.
Develop foolproof rollback strategies, enabling your team to confidently handle failures and minimize downtime.
Learn about Amazon EKS’s latest updates, best practices, and potential issues, ensuring your cluster remains secure, efficient, and up-to-date.

So, gather your SRE/Devops team, and let’s embark on this panicky journey of upgrading your Kubernetes cluster on Amazon EKS from version 1.24 to 1.25. Together, we’ll ensure zero downtime, maintain business continuity, and unlock the full potential of your technology infrastructure.

Pre-upgrade Checks and Prerequisites

Take note of every possible version of everything, especially the AMI id of the node groups instances. This might be needed if you need to roll back.
Application Compatibility: Ensure all applications are compatible with Kubernetes 1.25, referring to release notes and application logs for deprecation warnings
This includes:
- Making sure the network Overlay works with 1.25. Calico was notorious in wreaking unexpected havoc and missed lunches during this upgrade.
- Make sure that all Daemons and tools works with 1.25
- FluxCD: A user encountered an issue upgrading the Prometheus Operator CRD to 0.65.1, which is a dependency of FluxCD 46.6.0 (Issue #3953).
- Application Compatibility:Check if your applications are compatible with Kubernetes 1.25. You can do this by reviewing the release notes and checking your application logs for any deprecation warnings.
- Longhorn: A bug was reported with Longhorn 1.4.0 on Kubernetes 1.25 when using the enablePSP: false setting in the Helm chart (Issue #5185).
- Rook-Ceph: Users faced an issue while upgrading rook-ceph to 1.9.10 on Kubernetes 1.25 (Issue #10826).
- Make sure that your cluster did not force 1.24 to use docker instead of containerd.
- Check the AWS release notes for specific issues – https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions-standard.html
- Make sure that the node pools are not p2, as these have issues with the new Nvidia Drivers
- Backup: Backup all important data and configurations. This includes etcd snapshots, persistent volumes, Helm charts you may have and any custom resource definitions (CRDs). CRDs extend Kubernetes API with custom resources.
- Cluster Health: Ensure that your EKS cluster is running smoothly. This includes checking that all nodes are ready, all system pods are running, and there are no ongoing resource constraints. Any existing errors are like a disease and will spread during the upgrade.
- Permissions: Make sure you have the necessary IAM permissions to perform the upgrade. This includes permissions to update the EKS cluster and the node groups.
- Beware of Cluster Security changes
  - Deprication of policy/v1beta1. Any manifests or helm charts that use this must be completely overhauled to support policy/v1beta2.
  - Deprication in eks 1.25 are of PodSecurityPolicy. If you used PodSecurityPolicy, all of the work needs to be replaced and the kubeapi-server needs to be restarted, before commencing the upgrade. Use the new PSA security which extends and simplifies Cluster security.
- Additional Considerations from AWS Documentation
  - Instance Type Compatibility: Migrate from P2 to P3, P4, or P5 instances before upgrading due to NVIDIA driver changes.
  - API Priority and Fairness: Adjust workloads and APF settings to accommodate changes in LIST request handling.
  - AWS Load Balancer Controller: Upgrade to version 2.4.7 or later prior to the EKS cluster upgrade.
- kubectl Version: Update kubectl to support both Kubernetes 1.24 and 1.25. Best practice for upgrading is to upgrade kubectl after upgrading the control plane and before upgrading the node groups. Not doing this between upgrades for a single minro release from 1.24 to 1.25 should not destroy the cluster.
  - All automations like if they are run in Jenkins or other Pipeline controlling kubernetes with the CLI kubectl within scripts.
- Do a dry run upgrade with a dev cluster
- For extra super duper recoverability configure an extra EKS with all infrastructure options the same. In the event of a control plane upgrade failure or critical post-upgrade issues, you would otherwise need to provision a new EKS cluster with the previous Kubernetes version (1.24) and migrate workloads.

Before starting the upgrade to EKS 1.25 make sure that the whole cluster is doing well

Run

kubectl get nodes shows the version on each node

kubectl get pods –all-namespaces to check pods are up

kubectl version –short Checks the current client and controlplane versions

Step-by-step Upgrade of Kubernetes on EKS

Upgrade the EKS control plane to version 1.25 using the AWS Management Console, AWS CLI, or eksctl. For most of this document we are assuming that the upgrade to EKS 1.25 is being done manually through the AWS Console UI. Hopefully this part will go well, as it is impossible to rollback an eks control plane upgrade, even if it died halfway. So if you don’t have a backup cluster do create a new control plane.
Upgrade worker nodes sequentially, starting with one node group as a canary test. Below is more exact rules:

Strategies for Handling the Two Node Groups

If upgrading the control plane was completed successfully, you will see an “Update now” link next to each node group in the EKS.

Canary Group: Select one node group for initial upgrade and testing. This will serve like a canary in a mine. If it succeeds you can upgrade the other.
Make Sure that all pods and nodes are still in a good state. Re-run the kubectl commands from above.
Drain Nodes: Use
- kubectl drain <node-name> –ignore-daemonsets –delete-local-data
Upgrade Nodes: Apply the Kubernetes version upgrade to the draine nodes. For this single task using the AWS console UI is wise.
To upgrade the node groups in Amazon EKS using the AWS Management Console, follow these steps:
- Navigate to the Amazon EKS console and select your cluster.
- Select Your Cluster: In the Amazon EKS console, you will see a list of your clusters. Click on the name of the cluster that contains the node group you want to upgrade.
- Navigate to Compute: On the cluster’s overview page, look for the “Compute” section in the left-hand navigation pane.
  Click on “Compute” to expand the options.
- Select Node Groups: Under “Compute”, you should see an option for “Node groups”. Click on “Node groups” to view all node groups associated with your cluster.
- Click on the “Actions” button and select “Upgrade nodegroup”. You will see this ONLY of the control plane upgrade succeeded.
- In the “Upgrade nodegroup” dialog box, select the fdesired version of Kubernetes for the node group and then click on “Upgrade nodegroup”.
- Monitor the progress of the upgrade in the “Events” tab. Continue after the upgrade is successful. As with any proccess like this in AWS the amount of time it takes is unpredictable.
Uncordon Nodes: Allow scheduling of new pods on the upgraded nodes.
- kubectl drain <node-name> –ignore-daemonsets –delete-local-data
Validate Upgrade: Confirm nodes are running the new version and pods are healthy.

make sure that the whole cluster is doing well – Run:

kubectl get nodes shows the new version on each node

kubectl get pods –all-namespaces to check pods are up

kubectl version –short Checks the new client and controlplane versions

Repeat for Second Group: Follow the same process for the second node group.

Rollback Strategies in Case of Failures

Immediate Detection: It is best to have great detection abilities to immediately see whether an upgrade has failed.
- Implement monitoring and alerts to quickly identify issues post-upgrade.
- Have a clear set of health checks for the cluster, nodes, and workloads.
- If any issues occur during the upgrade of the first node group, you can rollback by cordoning the upgraded nodes and draining the nodes in the second group.
Control Plane Rollback: In case of control plane upgrade failure, provision a new EKS cluster with version 1.24. Hopefully you had created a backup cluster to save recovery time.
Node Group Rollback: Cordon upgraded nodes, drain the node group that failed, downgrade nodes to version 1.24,
- Drain Nodes:Drain the nodes in the first group one by one. This can be done using the
  - kubectl drain <node-name> –ignore-daemonsets –delete-local-data
  - If using managed node groups, you can roll back to the earlier version by selecting the previous Kubernetes version in the AWS Management Console or through the CLI. Personally, I have not tried this, only read about it.
  - If there are issues downgrading the node groups, you can use the old AMI, which is effectively a rollback
  - Uncordon the rolled back nodes, Once the nodes are back to the old version and verified to be stable, uncordon them to allow scheduling pods again.

kubectl uncordon

Data and Workload Rollback:
- Ensure that any data or stateful services have rollback procedures in place, such as restoring from backups or snapshots.
Post-Rollback Validation:
- Run all health checks and monitoring to ensure that the rollback was successful and that the cluster is fully operational.
- Validate that all services are running correctly and that there is no data corruption or loss.
Documentation and Communication:
- Document the rollback process thoroughly for each component of the cluster.
- Communicate with stakeholders about the rollback and any potential impacts.
Review and Analyze:
- After the rollback, conduct a thorough review to understand what caused the failure.
- Update the upgrade and rollback plans to incorporate lessons learned.
Have a second pre-configured cluster up, so that if the upgrade fails and destroys your cluster, you will be hours closer to getting it back up.

This whole proccess can also be done with Terraform or the CLI or even via API. I have given instructions only for in the UI.

Conclusion

This plan is designed to minimize downtime and ensure a smooth transition to Kubernetes 1.25. It is crucial to test the upgrade process in a staging environment before proceeding in production.

Comprehensive EKS Upgrade Guide to V1.25