Kubernetes — A Journey Has Just Begun
It was the summer of 2023 when the Infrastructure Team had just finished a long discussion about future plans, visions, dreams, and OKRs. Exhausted but hopeful and longing for new horizons, they conceived the project idea of "Container Journey."
Deciding that the first stop in the Journey should be a CI cluster, the team started drafting plans. Plans on how they would get there, what kind of underlying infrastructure they would need, how each Container would be loaded and executed, and what tooling would ensure that stress levels would be kept as low as possible during the Journey.
However, it would not be until September of that same year that the Journey actually started. In just a few days, a management ArgoCD cluster and many new VPCs were created, peering connections with VPN VPCs were established, and “cargo” — tooling like Datadog Agent & Fluent Bit — were loaded onto the cluster. The EKS (Elastic Kubernetes Service) was thus ready to embark! Coincidentally, it was around the same time I arrived at HackerOne, thirsty for adventure.
First Few Days at the Helm
We were just out of port when the challenge ahead of us started looking colossal! How would the GitLab Runners be properly installed in the Kubernetes cluster, and how could we ensure that they had all the permissions they required? We were in need of proper navigation techniques in this new and endless landscape.
At first, we thought we should try to set this up manually so that we could benefit by gaining a deeper understanding of all the different components that need to be glued together; Containers, Pods, IAM roles, EC2 instances, Kubernetes RBAC, and more. We sailed in circles for a few days, attempting to poke at the issue from different aspects, and keeping notes of all experiments we were conducting.
Having accomplished some but little progress, we were exhausted and frustrated. I remember thinking, “Creating a simple GitLab Runner Container shouldn’t be this hard! I’ve done it before, this isn’t optimal.”
Thus, we shifted our focus. “We have been looking at this wrongly all along!," we concluded. “We should just be using the official Helm Chart," we said, and agreed to make our lives simpler by offloading manual work to what GitLab itself provides, and a whole community out there already uses.
We regrouped, swarmed, and pair-programmed. We dived deep into impromptu Slack huddles, pre-arranged Zoom meetings, and took lots of notes that we passed around to the whole team. We quickly started seeing progress. A couple of days in, we had our first Runner up and running. A couple of days later, we started looking into making our new Runner as secure as possible. Two weeks after we pivoted, we had our first job, from the infrastructure repository, running on Kubernetes. We wrote proper documentation, and then opened the champagne and chilled back for a few days.
Catching the Wind
Soon after, we started wondering. Sure, yes, we had managed to create a complete and secure GitLab Runner, capable of running simple jobs. But was it seaworthy? We knew the answer was no. We had to come up with something better, something bigger, something that would take Core project’s intricate jobs and run them like there’s no tomorrow.
So, we created another Runner. This one was almost identical but not created equal to the first. It was designed with the ability to run complex jobs and thus had more privileges. We would, of course, keep both Runners, as each would serve a different purpose.
And then it was time to look at scaling our operations. A single MR (Merge Request) pipeline of Core project runs more than 180 jobs, around 170 of which start at the same time. Each of these jobs translates to a Kubernetes Pod. All of these Pods need CPU and Memory, that Nodes (EC2 instances) provide. Moreover, we’ve often observed 6 or more pipelines running at the same time. Thus, our next challenge was clear: how can we provide the compute power that — very importantly — scales up quickly when a pipeline starts and — equally importantly — scales down when no jobs are running? One of our goals as a team has always been to optimize our infrastructure, instead of throwing money at performance problems.
We decided to add another tool to our cluster; Karpenter. Even if Karpenter was in pre-beta at the time, it seemed very promising, and the community had already started seeing great value in it due to its architectural decisions, but also its seamless integration with AWS. It is able to create new Nodes in less than a minute and allows us to fine-tune scaling down to our heart’s desires. However, we still struggled to upgrade it to the latest beta version — it’s one of the challenges we often face as Infrastructure Engineers in our never-ending attempt to keep all the tooling up-to-date, as bugs are frequently introduced and then patched in subsequent minor version updates.
Eventually, being already two months in on our Journey towards a modernized CI, we had our EKS cluster, the GitLab Runners, Nodes scaling up and down based on demand, and we also had good documentation to go with it. We saw land on the horizon. It was time to take a short break and prepare for port. Avast ye scallywags! ’twas not for the sea to be calm for long!
The Calm Before the Storm and the Storm Before the Calm
Nearing our Christmas holidays, we all longed for quiet seas. That’s not the life we’ve chosen, though. And there it was on the horizon, a storm brewing — IP exhaustion! Turns out that a small thing we overlooked in our initial design had created trouble in our cluster. Trying to be frugal, we had created a small VPC, with just a few available IPs in its two subnets, but now that we wanted to scale to hundreds of Nodes and thousands of Pods, there were not enough IPs to hand out. We had to recreate the whole cluster!
Sounds scary, but us having got our sea legs a long time ago already, we had everything in IaC. Be it Terraform, Helm Chart configuration, or straight YAML manifests, everything is documented and peer-reviewed code. Thus, we managed to have the whole VPC, its peering connections, and the whole EKS cluster with its tooling up and running in half a day! That was an affirmation of the progress we’ve been making.
We got through the storm unscathed and in pristine state. For the next couple of weeks, we shifted our attention to other urgent matters, as we were also nearing the third month of Q4, and OKRs needed to be pushed past the finish line.
Life at the Docks
The winter holidays came and went, and we all managed to relax for a few days. With renewed enthusiasm and spirit, we started putting Core’s jobs on the Kubernetes Runners, slowly but surely. MR after MR was merged into the develop branch, and soon enough, we had almost reached our OKR, too.
We docked at port, but did not rest. We knew we needed to do maintenance on our ship, after the long Journey it had just pulled us through. We created alerts for the EKS cluster and its components. We wrote incident response playbooks to go with these alerts. We wrote a patch management procedure to ensure we keep our tools updated. We wrote even more documentation to easily propagate what we learned during our Journey to anyone interested in the topic; troubleshooting guides, explanations of concepts, reasoning behind decisions, and more.
And then we also started doing fixes. By continuously examining the setup’s behavior in action, we had found holes in our design. Yes, we had measured & examined the performance of the cluster and we were satisfied with it, but we also depend on a continuous feedback loop to keep improving ourselves and the company’s infrastructure. Our aim is to have a slim, fit, and performant setup.
We revised our Karpenter configurations once more, to further cut down on costs and at the same time improve pipeline duration. We did this by rightsizing instances based on the Core project’s pipeline’s jobs, and by also rightsizing the jobs themselves, by allocating enough CPU and Memory to them. This made our Nodes handle workloads even more efficiently. So, now, CPU and Memory usages increase proportionally, and Nodes are utilized to the maximum of their capacity.
We tackled completely new and exciting issues with resource allocation (CPU and Memory) that required lots of exact calculations and others with network bandwidth limits of ENA (Elastic Network Adapter) interfaces that made us dive deep into AWS capacity planning strategies.
There are still a couple of things we want to do to ensure a higher-standard Developer Experience and a more fault-tolerant setup, and we are tirelessly working on them.
However, we have observed that the job queuing time has decreased. We have observed that the average job duration has decreased too. And, after a few rounds of fixes and improvements, daily AWS costs have also started going down, being less than the average daily costs of the old runners.
And thus, we also reached our goals and OKR, almost without realizing it. We felt bittersweet. A Journey had ended.
Longing for Adventure
Even though we’ve now been docked for a month already, and still busy with fixing and improving not only our CI cluster but the whole platform, our longing for adventure whispers in our ears.
We reminisce about our Journey, realizing it was a tough and unpredictable one. Not only did we dive deep into a completely new, huge, and modern topic, but we also chose the CI cluster as the first stop. The CI cluster that needs to instantly scale from 0 to 100, quite literally in terms of Nodes. The CI cluster that handles far more than one type of workload, and needs to be extra secure because of that. The CI cluster that will, at peak times, have more Pods than any other clusters we will create in the future, probably even combined.
However, reflecting now and realizing that we managed well in this challenge, we bravely look to the horizons again. The horizons that hold not only more jobs and pipelines into our CI cluster but also exciting new ideas, methodologies, and tools!
We know what’s out there to discover: Graviton instances and ARM architecture that will further increase efficiency and drop costs, horizontal autoscaling based on sensible metrics like queue size or latency, green-blue and canary deployments for improved releasing, dynamic development environments that will further enable software development, vertical autoscaling based on resource usage, admission controllers, and service meshes for added security constraints, and much more.
We know the next Kubernetes journey will be challenging too. And the one after that too. But we feel confident in taking on these challenges and trust in our teamwork and eagerness, and our ability for deep work to overcome them.
We long to board our ship once more, hoist the sails, climb the masts, and face what the endless ocean of technology will throw against us.
We shall soon set sail again, and we invite everyone to come aboard our ship.
The Ultimate Guide to Managing Ethical and Security Risks in AI