Capacity Planning Never Retires

Table of Contents

Last month a service I operate started returning 340ms p99 responses where 18ms was the baseline. CPU utilization on the dashboard: a steady 38%. No memory pressure, no deployment, no obvious change. The signal was a single Prometheus counter: container_cpu_cfs_throttled_periods_total at 94%, on a pod whose average CPU usage looked healthy.

The runtime was sized for the node. The pod owned a slice of it. Nobody had told the runtime that.

I have been building distributed software for 25 years, mostly in financial services: FIX protocol gateways on Windows Server, WCF services moving millions of messages a day, Lightstreamer streaming market data and portfolio updates to trading desks, CI/CD pipelines for trading infrastructure. The symptom changes with each platform: MSMQ queue depth, JVM GC pauses, CFS throttle events. The root cause has not changed once. We provision one thing and tell the runtime another.

The Robusta Analogy #

Natan Yellin at Robusta.dev explains Kubernetes CPU limits through a desert survival story. Marcus and Teresa are travelling. They have a magical water bottle that produces 3 liters a day. Each person needs 1 liter a day to survive. Water is like CPU, a renewable resource. Consuming 100% at one moment does not deplete what arrives the next.

Three stories follow. In Story 1, there are no requests and no limits: Marcus drinks everything before Teresa can reach the bottle; she dies of thirst. In Story 2, CPU limits are set: Teresa falls ill and needs extra water, but her limit prevents her from drinking the available surplus; she dies despite water sitting untouched.

Story 3 is the model to build from:

Story 3 — without limits, with requests: Marcus gets very ill and needs extra water one day. He tries to drink the entire bottle but is stopped when only 1 liter remains in the bottle. This is saved for Teresa because she needs 1 liter a day. She drinks her 1 liter. Nothing remains. They both live. This is what happens when you have no CPU limits but you do have requests. All is good.

No limits. Accurate requests. The scheduler protects Teresa’s reservation while Marcus uses what remains. Eric Khun’s post from his time at Buffer puts numbers on the benefit: Buffer’s main landing page loaded 22x faster after removing CPU limits. His first recommendation is to upgrade the Linux kernel, because the CFS throttling behavior that causes the worst harm is a known kernel bug. The principle holds regardless: limits penalize work that could run on available cycles.

Story 4 — The Runtime That Didn’t Know #

Extend the story a year. Marcus and Teresa are still traveling. Same bottle, 3 liters a day. Each has a request of 1 liter. Day to day, both sip well under their allocation. Everything is fine.

Then one morning, Marcus receives 500,000 requests from other travelers. Teresa receives 1,000.

To handle the volume, Marcus and Teresa each install a water distribution framework: a network of pipes and valves designed to deliver water to every visitor. Each framework was sized to the bottle’s full 3-liter capacity the first time they looked at it. Sixty valves now compete for the same narrow spout.

The physics breaks down. Each valve opens expecting flow, finds contention instead, and waits. The next valve does the same. Pressure drops to zero. Water sits in the bottle while every pipe in both frameworks is blocked, waiting for capacity that no individual valve can acquire. Marcus’s 500,000 visitors receive nothing. Teresa’s 1,000 receive nothing.

The bottle still has water. Both stayed within their requests. The frameworks were sized for a resource pool that neither of them owned.

This is runtime oversubscription: the pod spec sets the allocation, but each runtime sizes its own parallelism. Both numbers have to agree.

Production follows the same pattern. A Go service on a 64-core node with a 2-core CPU request launches 64 goroutine processors, one per node CPU, because runtime.NumCPU() reports 64. A Java service builds its thread pools and GC worker count from availableProcessors(), which returns 64 for the same reason. Under load, both runtimes drive 64 parallel workers into 2 cores of CFS quota. The kernel throttles them. Latency climbs. CPU utilization, which counts running time rather than throttled time, reports a healthy 38%. The throttle counter tells a different story.

The Governance Tax #

Linux enforces CPU limits by slicing time into 100ms CFS periods. If a pod’s quota is exhausted within a period, every thread waits until the next period begins. A request that should complete in 20ms at full CPU can instead block for 80ms, not because the work is slow but because the pod hit its quota boundary mid-computation. That is the governance tax.

Accurate requests without limits eliminate the tax. Eric Khun’s 22x confirms how much the tax costs.

Removing CPU limits requires accurate requests. Without requests, you are back in Story 1: Marcus drinks everything. Set requests based on observed peak usage with at least a 20% margin. KRR from Robusta sizes them from historical Prometheus data.

Market data feeds, FIX engine sessions, and order acknowledgement paths cannot tolerate even CFS scheduling jitter. For those, Kubernetes CPU Manager provides a stronger option. With --cpu-manager-policy=static on the kubelet, pods at Guaranteed QoS with whole-integer CPU requests get exclusive CPU cores. The kernel assigns them statically, bypassing CFS entirely.

The qualifying pod spec:

resources:
  requests:
    cpu: "2"
    memory: "4Gi"
  limits:
    cpu: "2"
    memory: "4Gi"

requests and limits must be equal, and both must be whole integers. The kubelet needs --cpu-manager-policy=static. Setting limits equal to requests is required to reach Guaranteed QoS, which gates static CPU assignment. On exclusive cores, the limit is a classification label, not a CFS enforcement boundary. CFS quota still exists on paper but cannot throttle a pod that owns 100% of its pinned CPUs. This is a different animal from the CFS limits the rest of this post argues against.

For a FIX gateway or a low-latency matching engine, this is the right tool. For a web API under variable load, requests-only is correct. You want the burst headroom.

Telling the Runtime the Truth #

CPU Manager handles the kernel side. You still need to tell each language runtime its actual allocation.

Before Go 1.25, GOMAXPROCS defaulted to runtime.NumCPU(), which reports node CPUs rather than pod CPUs. Under load, a pod limited to 2 cores would run 64 goroutine processors (P’s) contending for 2 cores of quota. The standard fix for older codebases is uber-go/automaxprocs, imported as a blank side-effect. It reads /sys/fs/cgroup/cpu/cpu.cfs_quota_us at startup and sets GOMAXPROCS to match. Go 1.25 made this behavior part of the standard library. For anything older, the library is a one-line import.

Wire the CPU limit through the Downward API so the runtime reads its allocation from the pod spec:

env:
  - name: GOMAXPROCS
    valueFrom:
      resourceFieldRef:
        resource: limits.cpu

Java / OpenJDK

The JVM sizes thread pools, GC workers, and JIT compilation threads from Runtime.getRuntime().availableProcessors(). Before JDK 8u191, this returned node CPU count regardless of cgroup context. -XX:+UseContainerSupport, the flag that gave the JVM cgroup awareness, became the default in 8u191 and OpenJDK 10. JDK-8281181 later removed a CPU-shares-based approximation that misread Kubernetes requests as hard limits; modern JDKs read the CFS quota directly.

For explicit control, -XX:ActiveProcessorCount=N overrides everything. Set it through the Downward API to pin JVM concurrency to the pod’s allocation.

.NET

System.Environment.ProcessorCount drives the .NET thread pool’s initial worker count, GC server threads, and Parallel.ForEach partition sizing. Without container awareness, it returns node CPU count. .NET 6 improved cgroup awareness. The safest baseline is DOTNET_PROCESSOR_COUNT set from the pod spec:

env:
  - name: DOTNET_PROCESSOR_COUNT
    valueFrom:
      resourceFieldRef:
        resource: limits.cpu

The ASP.NET thread pool expands dynamically and self-corrects through throttle feedback, but initial sizing matters at startup. A pod that receives a surge within seconds of first deployment can spend its first minutes configured for 64 cores.

The One Metric That Exposes This #

Track container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total as its own alert, separate from CPU utilization. Throttled time does not count as utilized time, which is why standard CPU graphs hide the problem.

In Prometheus, the alert rule looks like this:

- alert: HighCFSThrottling
  expr: >
    rate(container_cpu_cfs_throttled_periods_total[5m])
    / rate(container_cpu_cfs_periods_total[5m]) > 0.5
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Pod {{ $labels.pod }} is CFS-throttled at {{ $value | humanizePercentage }}"

When this ratio approaches 1.0 while utilization reads normal, you have found the runtime oversubscription failure before the on-call queue does.

Twenty-five years later, the tools have new names. The failure has not. Kubernetes defines the pod’s allocation through requests. Each runtime reads the node and sizes its concurrency to match. Production finds out when those two numbers disagree.