Kubernetes HPA Explained Through Control Theory

One Saturday morning I was fixing my coffee grinder while watching James Hoffmann explain how a PID controller works in an espresso machine — using a ship navigation system as the example.

Something clicked.

When I'm not brewing coffee, part of my day job is writing Kubernetes operators. And somewhere between the proportional term and the integral term, I realized I'm facing the same problems the best engineering minds started tackling in the 18th century. It wouldn't have been such a surprise if I'd paid more attention in my final years at university. But we are where we are.

The Kubernetes reconcile loop is a feedback control loop. You declare a desired state (the setpoint), the system continuously observes actual state (the process variable), computes the difference (the error), and takes corrective action (actuation). This is the same pattern that's been studied since James Watt put a centrifugal governor on a steam engine.

I went looking to see whether control theory could offer something Kubernetes was missing. Something I was the first to think about.

I was not the first one, of course. So this is an attempt to look at Kubernetes autoscaling through the lens of control theory.

Quick Recap

If you haven't watched the video I'm referring to, here's the ship navigation example from it.

You're steering a ship toward a heading. You're 30 degrees off course. The proportional (P) controller says: "I'm 30 degrees off, so I'll turn the wheel proportionally to the error." Big error, big correction. As you get closer to the heading, the correction shrinks.

But the ship has inertia. By the time you've reached the correct heading, you're still turning — so you overshoot. Now you're 10 degrees off the other way. You correct again, overshoot again. The ship oscillates around the heading, zigzagging forever. And if there's a constant crosswind pushing you off course, the proportional controller settles at a small persistent error — never quite reaching the heading, because it needs some error to generate the corrective force that counteracts the wind.

The integral (I) term fixes the crosswind problem. It accumulates the error over time. Even if the error is tiny, if it persists, the integral grows until the controller pushes hard enough to eliminate it completely. "We've been 2 degrees off for the last 10 minutes — that's not noise, that's a real offset. Apply more rudder."

The derivative (D) term fixes the overshoot. It looks at how fast the error is changing. If the ship is approaching the correct heading rapidly, the derivative says "we're converging fast — ease off the rudder now, before we overshoot." It's a brake. It also works in reverse: if the error is growing quickly (a sudden gust), the derivative reacts to the rate of change before the proportional term has fully caught up.

P reacts to the present, I remembers the past, D anticipates the future. With that in my head, I opened the Kubernetes codebase trying to see it from a different angle.

First Approximation: HPA as a Proportional Controller

The first thing that comes to mind when you think about the feedback loop in Kubernetes is the HPA. The Horizontal Pod Autoscaler has a core formula:

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))

To me it looked like a proportional-only controller. The output is proportional to the utilization ratio between the current metric and the target metric.

If that framing holds, control theory predicts specific problems.

Drawback 1: The Deadband Trade-off

In a classical P-controller, the output is proportional to the error. No error, no output. In a system with constant disturbances — varying traffic, noisy metrics, changing load patterns — the controller settles into a compromise: enough output to counteract the disturbance, but not enough to fully eliminate the error. The residual gap is called "offset" or "droop."

In Kubernetes, two distinct phenomena force a similar trade-off: quantization creates a real persistent offset, and metric noise would create oscillation if reacted to.

Quantization. Replicas are integers. Say the target is 50% CPU, and you have 3 pods at 70%:

usageRatio = 70 / 50 = 1.4
desiredReplicas = ceil(1.4 * 3) = ceil(4.2) = 5

With 5 pods the actual utilization becomes ~42%, not 50% — an 8 percentage-point gap. Now run the formula again: the ratio is 42/50 = 0.84, and ceil(5 × 0.84) = 5. The replica count is frozen — not because the error is small, but because ceil() always rounds up and there's no smaller integer the formula will pick. You're stuck with an offset the control law itself can't correct.

Metric noise. CPU utilization isn't a steady signal. Garbage collection pauses, bursty request patterns, OS scheduling jitter — real workloads fluctuate between, say, 48% and 53% every cycle. Without some tolerance, the controller would react to every jitter: scale up on 53%, scale down on 48%, scale up again.

Two different problems, one shared answer: a deadband. It accepts the quantization offset (the control law can't fix it anyway) and ignores the noise (so we don't react to jitter). One mechanism, two purposes.

The HPA introduces a tolerance that deliberately ignores changes when the ratio between current and target metrics falls within +/-10%. If your target is 50% CPU and you're at 54%, the ratio is 1.08 — within tolerance. No action.

Since Kubernetes 1.33, the HPAConfigurableTolerance feature lets you tune this per direction. You might set scale-up tolerance to 5% (react sooner when load increases) and keep scale-down tolerance at 15% (tolerate over-provisioning longer). Smaller tolerance means less offset but more risk of oscillation — which leads to the next drawback.

In control theory, this is a dead band. The controller stops reacting to small errors, which prevents noise amplification and limit cycling — the constant scale-up/scale-down thrashing that would otherwise burn through your quota and your team's patience. The trade-off is explicit: small errors persist within the band. Give up on perfection to gain stability.

Drawback 2: High Gain Leads to Oscillation

High proportional gain means the controller reacts aggressively to small errors. The output overshoots, which creates an error in the opposite direction, which causes another correction, and the system oscillates around the setpoint.

Let's trace through an example. Target CPU 50%, starting from 4 pods.

Cycle 1: 4 pods running at 80% CPU average.

desiredReplicas = ceil(4 * (80 / 50)) = ceil(6.4) = 7

Scale to 7 pods. Makes sense — you need more capacity.

Cycle 2: The 7 pods are running, but the original spike has subsided and each pod is at 35%.

desiredReplicas = ceil(7 * (35 / 50)) = ceil(4.9) = 5

Scale down to 5. Overshot.

Cycle 3: Load returns, and with only 5 pods, CPU climbs back to 65%.

desiredReplicas = ceil(5 * (65 / 50)) = ceil(6.5) = 7

Back to 7. Each cycle looks only at the current error, with no memory of what happened last cycle and no awareness that it's been bouncing between 5 and 7.

Even with the default 10% tolerance, real workloads still produce spikes that exceed it. A GC pause might push CPU to 80% for a few seconds, triggering a scale-up. By the time new pods are ready, the spike is gone and the system wants to scale back down.

Kubernetes addresses this with two layers.

The stabilization window is a moving-maximum filter. Here's something that isn't obvious if you only look at the documentation: the HPA doesn't directly scale your deployment every cycle. Each cycle it computes a recommendation — the desired replica count based on current metrics. But before acting on it, HPA looks back at all recommendations it has computed over the last stabilization window and picks the most conservative one:

for i, rec := range a.recommendations[key] {
    if rec.timestamp.After(upCutoff) {
        upRecommendation = min(rec.recommendation, upRecommendation)
    }
    if rec.timestamp.After(downCutoff) {
        downRecommendation = max(rec.recommendation, downRecommendation)
    }
}

For scale-down, the default behavior is important: HPA keeps the highest recommendation from the last 5 minutes, which acts like a rolling maximum and prevents immediate downscale after a short-lived drop. The same mechanism exists for scale-up too, where it would choose the lowest recent recommendation inside the configured scale-up window. But by default that window is 0 seconds, so scale-up is effectively immediate unless you configure otherwise.

Rate limiting is slew rate clamping. Even after stabilization, HPA caps how fast you can scale:

func calculateScaleUpLimit(currentReplicas int32) int32 {
    return int32(math.Max(scaleUpLimitFactor*float64(currentReplicas), scaleUpLimitMinimum))
}

Where scaleUpLimitFactor is 2.0 and scaleUpLimitMinimum is 4. You can at most double your replica count per cycle, with a floor of 4 pods. With behavior policies, you get finer control: "add at most 4 pods per 60 seconds" or "remove at most 10% per minute." It's worth noting that rate limiting and derivative control aren't equivalent mechanisms — a derivative term opposes rapid changes in the error signal (it reacts to what's happening in the system), while rate limiting caps changes in the controller output (it doesn't care why, it just enforces a speed limit).

Drawback 3: Low Gain Leads to Slow Response

Low proportional gain reduces oscillation but makes the system sluggish. If load doubles suddenly, a low-gain controller takes many cycles to catch up.

The HPA's tolerance band is exactly this trade-off. Inside the ±10% band, the controller's effective gain is zero — it ignores the error entirely. That's what kills oscillation, but it's also why a deviation of 8% sits there indefinitely. The same mechanism that buys stability buys sluggishness. Tighten the band and you regain responsiveness at the cost of jitter; widen it and you're slow.

On top of that, the stabilization window adds a second layer of damping: even when the controller does react, it picks the most conservative recommendation from the last several minutes rather than the freshest one. And rate limits cap how fast that recommendation can translate into actual pods. Three layers, each one trading speed for stability.

What saves HPA from being uniformly slow is asymmetric defaults. Scale-up stabilization is 0 seconds — react immediately. Scale-down is 5 minutes — wait and be sure. The K8s logic: "I'd rather waste money on extra pods than risk an outage from having too few." The behavior API lets operators tune each direction independently: aggressive scale-up paired with conservative scale-down.

What If HPA Were a PID Controller?

All three drawbacks — offset, oscillation, slow response — are exactly what the integral and derivative terms were designed to fix. The 4→7→5→7 oscillation kept nagging me. What would it take to converge smoothly?

I started sketching a PID-based autoscaler CRD. A Prometheus query as the metric source, an explicit setpoint, and three gain knobs — kp, ki, kd — controlling the proportional, integral, and derivative response:

spec:
  scaleTargetRef:
    name: payment-event-worker
  metric: "sum(kafka_consumergroup_lag{consumergroup='payment-events'})"
  setpoint: 500
  pid:
    kp: 0.15
    ki: 0.05
    kd: 0.10

The obvious question I couldn't answer: what should kp, ki, and kd actually be? These aren't configuration in the usual Kubernetes sense — they're control parameters. A bad value isn't merely suboptimal; it can make the system unstable. I put in placeholders for now.

Then I started thinking about what happens when you actually run it.

The Walls

Dead Time

There's a delay between requesting more pods and seeing their effect in the metrics. In control theory, this is dead time — and of all the problems a controller faces, it's the one that most often defeats naive PID tuning.

Think of a shower with old plumbing. You turn the hot water knob. Nothing changes. You wait. Still cold. You turn it more. Nothing. You crank it — and all at once, scalding water arrives. You jerk it back. Now you're oscillating between extremes, and you can't stabilize because every adjustment takes seconds to show up. The problem isn't your technique. It's the gap between action and result.

In Kubernetes, dead time is everywhere:

Scenario	Dead time
Image is cached on node	~2 seconds
Image needs to be pulled	~30 seconds
Node exists, pod needs scheduling	~5 seconds
Cluster is full, Cluster Autoscaler requests a new VM	2-5 minutes
Cloud provider hits quota limit	Infinite (pod stays Pending forever)

Aaaaand... — dead time is non-deterministic. You don't know which scenario you're in until you're in it. A controller tuned for 5-second dead time will overshoot catastrophically when the actual delay is 4 minutes.

How does HPA handle this? We've already seen part of the answer — the stabilization window reduces the controller's bandwidth, making it respond more slowly so it doesn't overreact during the delay.

But HPA also does something domain-specific. Before computing the desired replica count, it classifies every pod into categories:

func groupPods(pods []*v1.Pod, metrics metricsclient.PodMetricsInfo, ...
) (readyPodCount int, unreadyPods, missingPods, ignoredPods sets.Set[string]) {
    for _, pod := range pods {
        if pod.DeletionTimestamp != nil || pod.Status.Phase == v1.PodFailed {
            ignoredPods.Insert(pod.Name)     // deleted or failed — skip entirely
        } else if pod.Status.Phase == v1.PodPending {
            unreadyPods.Insert(pod.Name)     // still scheduling — in the pipeline
        } else if _, found := metrics[pod.Name]; !found {
            missingPods.Insert(pod.Name)     // running but no metrics yet
        } else {
            readyPodCount++                  // ready — count normally
        }
    }
}

A detail worth noting: the HPA formula uses readyPodCount — the number of pods actually reporting metrics — not the total replica count. When all pods are healthy, they're equal. The distinction matters precisely here, when pods are in flight.

There's also a CPU-specific rule: pods in their first 5 minutes of life may have their metrics ignored entirely, because startup spikes — JVM warmup, cache loading, dependency initialization — would corrupt the autoscaling signal.

Then HPA injects assumptions based on scaling direction:

if len(missingPods) > 0 {
    if usageRatio < 1.0 {
        // on a scale-down, treat missing pods as using 100% of the resource request
        for podName := range missingPods {
            metrics[podName] = metricsclient.PodMetric{Value: requests[podName]}
        }
    } else if usageRatio > 1.0 {
        // on a scale-up, treat missing pods as using 0% of the resource request
        for podName := range missingPods {
            metrics[podName] = metricsclient.PodMetric{Value: 0}
        }
    }
}

if scaleUpWithUnready {
    // on a scale-up, treat unready pods as using 0% of the resource request
    for podName := range unreadyPods {
        metrics[podName] = metricsclient.PodMetric{Value: 0}
    }
}

// re-run the utilization calculation with our new numbers
newUsageRatio, _, _, err := metricsclient.GetResourceUtilizationRatio(metrics, requests, targetUtilization)

On scale-up, unready and missing pods get 0% CPU — "they're coming, don't panic." On scale-down, missing pods get 100% CPU — "assume the worst, don't kill pods that might be busy." Then HPA re-runs the entire utilization calculation with these injected values. The effect: when you request 10 new pods, HPA folds them into the average at 0% utilization, the usage ratio drops, and the controller doesn't request more pods while the pipeline is still filling.

It's not a mathematical model of the delay. It's a practical, domain-specific knowledge — and it works.

My PID CRD would need something too. I added a behavior section:

  behavior:
    syncPeriodSeconds: 15
    expectedPodStartupSeconds: 30

expectedPodStartupSeconds would tell the controller to dampen its response during the estimated delay window. syncPeriodSeconds sets the reconcile interval, and the PID calculation would compute dt dynamically from wall clock time rather than assuming fixed steps, since reconcile loops don't fire at precise intervals.

But expectedPodStartupSeconds: 30 is a fiction. Pod startup time ranges from 2 seconds to several minutes depending on image caching, node availability, and cloud provider mood.

And the sensor has its own delay. Prometheus scrapes metrics every 15 to 60 seconds, so the derivative term would be computing the rate of change of a staircase function — zero derivative for one cycle, then a massive spike when the next scrape lands. That makes the D-term actively harmful on discretized signals. Another knob that has to be tuned against the same delays the controller is already fighting.

Pod Quantization

The PID controller computes a continuous output — say, 7.3 replicas. I round to 7. Next cycle the accumulated error is slightly positive, the output is 7.2 — still 7. The integral slowly accumulates this rounding error until it pushes the output to 7.5, which rounds to 8. One cycle later we're over-provisioned, the error flips, and the same process runs in reverse.

This is a classic source of limit cycling. When the actuator is quantized, especially at low replica counts, a controller can keep accumulating error until it crosses the next integer boundary, overshoot, and then repeat in the opposite direction. PID gains alone do not make this disappear; you usually need extra mechanisms.

At 50+ pods, quantization is noise. At 2–5 pods, it dominates. Going from 2 to 3 is a 50% capacity jump — not a rounding error, a step change. PID was designed for continuous actuators: valves that open 43.7%, motors that spin at 1847 RPM. Pods are stairsteps, not a ramp.

The "Aha" Moment

Every individual issue has known solutions. Anti-windup clamps, dead-time compensators, derivative filters. But the same problems that led K8s to build its tolerance bands, stabilization windows, and rate limits would force me to add similar mechanisms to my PID controller. At which point — what have I gained? A more complex controller that arrives at the same engineering trade-offs, except now your SRE team also needs to understand integral calculus to debug a scaling incident at 3 AM.

I went back to the formula.

Rewrite it as a change in replicas rather than an absolute count:

desiredReplicas - currentReplicas = currentReplicas * (currentMetric / targetMetric - 1)

The effective gain is currentReplicas / targetMetric. With 100 pods targeting 50% CPU, the gain is 2.0 — aggressive. With 10 pods targeting the same, it's 0.2 — cautious. HPA automatically scales its aggressiveness with deployment size. That's gain scheduling — exactly the thing my CRD hardcoded as kp: 0.15.

And that's not all I've missed. A textbook proportional controller computes an incremental correction proportional to the error: output_change = Kp * (setpoint - measured). The current error drives the current correction. That's the definition.

HPA does something different. It computes the absolute desired state:

desiredReplicas = ceil((currentUtilization / targetUtilization) * readyPodCount)

The formula embeds an assumption about the plant: scaling is linear. If 10 pods produce 80% utilization, then 16 pods should produce 50%. It inverts the plant model to compute the exact replica count needed.

If that linear model is correct, the formula converges in one step — setting aside the dead time we just discussed, which delays when that step's effect actually shows up in the metrics. A classical P-controller never converges in one step — it always leaves a residual error. HPA's offset doesn't come from the control law itself — it comes from ceil() rounding and the deliberate tolerance band.

So HPA isn't a naive proportional controller. It's a one-step model inversion with built-in gain scheduling — assuming linear scaling, it computes the exact answer in a single cycle. The tolerance band, stabilization windows, and rate limits aren't patches for a bad controller. They're handling the things the model can't and the fact that real systems aren't linear.

I came in thinking I could improve on HPA with textbook control theory. What I found is that the people who built it solved the same problems — they just started from the Kubernetes side. Their solutions are more robust precisely because they're shaped by the actual failure modes: pods that take unpredictable time to start, metrics that arrive in 15-second steps, actuators that only accept integers. A PID controller fights these constraints. HPA was built around them.

What's Next

HPA gets to cheat. Replicas are cheap, fungible, and additive — doubling the count roughly doubles the capacity, and the actuator is at least monotonic.

VPA isn't so lucky. Changing a pod's CPU or memory request means restarting it. The actuator is quantized, and every move is destructive. What does the control theory lens say about a system where actuation itself causes the disturbance?

That's Part 2.

Kubernetes Through Control Theory Glasses — Part 1: HPA

Quick Recap

First Approximation: HPA as a Proportional Controller

Drawback 1: The Deadband Trade-off

Drawback 2: High Gain Leads to Oscillation

Drawback 3: Low Gain Leads to Slow Response

What If HPA Were a PID Controller?

The Walls

Dead Time

Pod Quantization

The "Aha" Moment

What's Next

References

Comments

More from this blog

Ordered Retries in Kafka: Why You Probably Shouldn't Build This

Ordered Retries in Kafka: The Bugs You'll Find in Production

Ordered Retries in Kafka: Why The Retry Topic Is Breaking Downstream

The Ghost in the Database: Why Is This Empty Table Taking 20 Seconds to Query?

Command Palette

Quick Recap

First Approximation: HPA as a Proportional Controller

Drawback 1: The Deadband Trade-off

Drawback 2: High Gain Leads to Oscillation

Drawback 3: Low Gain Leads to Slow Response

What If HPA Were a PID Controller?

The Walls

Dead Time

Pod Quantization

The "Aha" Moment

What's Next

References

Comments

More from this blog