With time, any active Kubernetes cluster goes through a recurring series of deployments and periodic scale-out, which translates to adding or removing pods and nodes from the cluster. This cycle generally introduces several inefficiencies in the clusters, affecting both performance and costs. Of these, one of the most significant, but rarely diagnosed problems, is that of resource fragmentation.
Inadvertently, pods end up getting scheduled across nodes in such a way that for any new pod, ALL of the resources requested by it are unavailable at any single node, making the pod un-schedulable. Even though your overall Kubernetes cluster might have much more capacity available across the nodes, a scale-up is still needed. It creates a "pseudo" resource crunch that could be avoided by rescheduling the pods across the nodes. Especially in large-scale clusters, this fragmentation can waste a lot of resources and incur unnecessary infrastructure costs.
In this article, I’ll explain how inefficient and fragmented Kubernetes clusters can develop, how to identify them, and a practical solution to rebalance your clusters in order to improve Kubernetes performance and reduce cloud costs.
To understand the problem, let's look at Kubernetes schedulers. Kubernetes schedulers determine the optimal node where your pod should be placed, honoring the resource requests, as well as achieving a tight packing of pods (to reduce unused fragments). It’s a two-step process:
- Filtering: Kubernetes schedulers filter the nodes that can fit the new pod in terms of their available resources (called candidate nodes).
- Scoring: Schedulers score all the above nodes according to various metrics, such as memory pressure, disk usage, docker image availability, and several others. The node with the highest score is selected and the pod is placed there.
The scoring step effectively determines the node selection among multiple candidates. Hence, the entire scope of any optimization in the scheduling process lies with the scoring algorithm, which can be modified by the user to create their own custom schedulers for their specific needs.
However, if you observe closely, optimization is only possible when there are a sufficient number of candidate nodes to place the workload. Without any options, scoring is irrelevant. As resources decrease or the number of pods increases, the available choices to place the pods reduce significantly.
Consequently, all optimizations start getting applied on a best effort basis, since the pod has to be placed somewhere. Once this calculated way of filling the nodes with pods starts getting deviations in bin packing, it leaves small fragments of unused resources, which has an avalanche effect on the upcoming pods.
Problem one: Placement failure
Let’s take a look at a real-world example. In the image below, you can see we have three nodes with a total of 1000m of CPU and 2GB RAM each. Nine pods (blue) are currently running with their respective resource “requests.” A new pod (orange) is requesting 300m of CPU and 600MB of RAM, but remains un-schedulable. This is because none of the nodes has both 300m CPU and 600MB RAM. This is in spite of the fact that the entire cluster has a total of 600m and 1200MB of RAM available.
What happens next?
Problem two: Imbalanced placement
The pod remains pending for resources until more capacity is added to the cluster. If your clusters have Auto Scaling Groups (ASGs) configured, Kubernetes would trigger a scale-up, adding more nodes to the cluster, resulting in additional infrastructure costs.
In general, such a cost increment is anticipated, as more workloads need to be run. However, your cluster has many more resources already present, albeit in a fragmented fashion. If we consolidated, we could’ve placed the pod in the cluster without having to add additional capacity.
In this case, capacity addition is not only futile and incurs more costs, but it also creates an imbalance in the system—where three of the existing nodes have significant resources consumed but newly added ones don't. An extreme example of this could look something like the below.
As we see, due to the deviations from the original and ideal placement plans, we get “pseudo” resource crunches. Trying to fix this by adding nodes results in imbalances in the system, which can cause even more deviations from the most ideal placements possible, and this causes a runaway condition, causing more and more imbalances and deviations.
Deviations → Pseudo Resource Crunch → Imbalance → Deviations → and on...
So, what can be done?
Rebalancing fragmented Kubernetes clusters
Both of the problems we’ve just covered (placement failure and imbalanced placement) are a result of the same issue: schedulers can’t predict the future. Kubernetes will never know all the nodes that will be added or all the resource footprint pods you’re going to run, which means no scheduler can plan for it. As such, there are bound to be inconsistencies in scheduling. What we need is a solution that can periodically identify these situations and reorganize outside of Kubernetes.
For example, just by moving Pod A from Node1 to Node2, we’d have consolidated our required resources (400m, 900MB) in Node1, and the pending Pod X would be comfortably placed on Node1 by the scheduler.
We could use the same migration approach by swapping CPU and memory-intensive pods for a healthier balance of resources across all nodes.
Rebalancing fragmented Kubernetes clusters at scale with automation
Now it may seem like a fairly straightforward approach, but identifying such options is almost certainly never this easy even in medium-scale clusters. For example,
- We might need to migrate two or three small pods across multiple nodes to fit a larger pod.
- We might not be able to fit the new pod at all in some cases.
- Identifying imbalances across multiple resources and thousands of pods is nearly impossible.
As such, we’ve devised a couple of algorithms—the placement engine and the balancing engine—that can automatically generate migration plans to solve for placement failure and imbalanced placement within your Kubernetes clusters. Both algorithms can be run as another pod in the cluster or connected separately via exposed APIs. They interact with Kubernetes via kube-proxy to get the cluster state information and perform an in-memory simulation of the migrations, returning the best possible migration plans.
Users have full flexibility in excluding the pods and nodes that don’t need to be included in this process by appropriately tagging them. Approved migrations are then executed by the system in a step-by-step way by altering the node affinities, keeping the track in a stack to revert in case of failures.
The placement engine
As the name suggests, the placement engine provides a step-by-step migration plan of pods across nodes in order to fit a new pod that cannot be placed anywhere in the pod. Now, as we know, bin packing is an NP-Hard problem, we’ve had to use a few approximations in order to get the algorithm running reasonably fast (points three and four). So how does the placement engine work?
- Combines multiple resource values into a single dimension via pairing functions. This is called “effective capacity.”
- In order to place any pod, the engine always removes any existing pod/group of pods with a total “effective“ capacity lesser than the incoming pod. This guarantees that in the next recursive call, we’ll have a lesser capacity to place in the cluster, and as such, the problem has a reasonable chance to converge.
- Uses dynamic programming to avoid futile computations. The result of each recursive call is cached and used if required.
- The engine can stop at a pre-defined depth of the branch to reduce runtime. However, whether this is required depends on the way the algorithm is run (like a background task periodically in off-peak hours).
- A record is kept of the number of migrations in each solution that converges and is able to place the pod. The migration plan with the least number of migrations is suggested.
The balancing engine
The balancing engine reduces the inconsistencies and imbalances in the clusters by carrying out a series of swaps—swapping out two pods on different nodes. The intention is to have a healthy balance of different types of resource-consuming applications together so that all types of resources are consumed fairly (e.g. we don’t want a node with multiple CPU-intensive applications, since they will starve among themselves). So how does the balancing engine work?
- Calculates the most ideal scenario of pod placement with respect to resource balancing. In the case of two resources (CPU and memory) it can simply be a ratio. Based on the overall requests of the pods, it calculates what should be the ideal ratio for each node to have. Placing them on the number line, node placement will be something like the above image.
It then calculates a mathematical construct called Entropy that determines the overall imbalance in the system, for each node and the system. This is the total distance of all the nodes from the system pivot.
Then the engine carries out a series of swaps of pods on nodes on both sides of the system pivot, to see if it reduces the overall deviation (entropy) in the system. The idea is to swap the most memory-intensive pod from the rightmost node with the max CPU-intensive pod on the leftmost node, as these are the ones contributing to the imbalance, and a swap is guaranteed to reduce overall entropy. This is continued for a pre-defined number of times or until no significant reduction in entropy is observed. The following figure shows the step-by-step entropy reduction in a cluster with 50 iterations.
We’ve used this methodology in one of our test environments, where we successfully avoided several node scale-outs. To get an estimate of the costs we saved, we calculated the total hours of the hypothetical node(s) we saved by avoiding the scale-out (until the time it was absolutely necessary). Then using CloudHealth Perspectives, we calculated the total costs we would have had to spend otherwise. It turned out to be about $5,900 for our cluster projected yearly—a significant 17%!
This methodology was presented in more depth during the Kubernetes Forum Bengaluru 2020 earlier this year, and we’re planning to open-source these solutions in the near future. Stay tuned!