Pod priority and preemption under node pressure

k8s-operator-notes · field notes on scheduler and kubelet interactions

Pod priority is one of the more subtle knobs in a multi-tenant Kubernetes cluster. The headline behavior is simple: higher-priority Pods can preempt lower-priority Pods when the scheduler cannot place them. The less-obvious behavior is what happens when node pressure enters the picture — at that point you have two independent eviction systems (the scheduler's preemption logic and the kubelet's node-pressure eviction) acting on the same Pods, sometimes in opposing directions. These notes capture how the two interact in practice on a typical production cluster.

PriorityClass basics

A PriorityClass is a cluster-scoped object that defines an integer priority value plus a few policy fields. Pods reference it by name via spec.priorityClassName; the admission controller resolves this into an integer that the scheduler reads as spec.priority. Two built-in classes ship with the cluster: system-cluster-critical and system-node-critical, both above 2×109. User workloads should stay well below that range.

The pattern most teams settle on is a three-tier setup:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: tier-high
value: 10000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: Latency-sensitive request-path workloads.
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: tier-medium
value: 1000
globalDefault: true
preemptionPolicy: PreemptLowerPriority
description: Default for general application workloads.
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: tier-low
value: 100
globalDefault: false
preemptionPolicy: Never
description: Batch and best-effort workloads. Will not preempt others.

Two details worth flagging. First, globalDefault: true applies only to Pods admitted after the PriorityClass is created — existing Pods are not retroactively updated. Second, preemptionPolicy: Never still allows the Pod itself to be preempted by higher-priority Pods; it only prevents the Pod from preempting others. Confusing the direction here is a recurring source of incident reports.

How preemption decides which Pod to evict

When a higher-priority Pod is unschedulable, the scheduler walks each node and asks: would removing some set of lower-priority Pods make this node fit? It looks for the smallest victim set that satisfies resource requests, affinity, and PodDisruptionBudgets. If multiple nodes produce candidate victim sets, the scheduler prefers the one with the lowest-priority victim and the fewest violations of disruption budgets. There is no graceful warning to the victim — it receives a normal termination signal and the terminationGracePeriodSeconds applies.

PodDisruptionBudgets are respected on a best-effort basis. If respecting the PDB would block scheduling indefinitely, the scheduler will violate the PDB rather than starve the higher-priority Pod. This is documented but routinely surprises operators who assume PDBs are absolute.

Node-pressure eviction sits outside the scheduler

The kubelet runs its own eviction loop based on node conditions: memory pressure, disk pressure (imagefs or nodefs), PID pressure. Thresholds are configured at kubelet start (--eviction-hard, --eviction-soft) or via the KubeletConfiguration object. When a threshold is breached, the kubelet selects Pods to evict based on a separate ordering: BestEffort QoS first, then Burstable Pods exceeding their requests, then Guaranteed Pods only as a last resort. Priority is a tiebreaker within QoS bands, not the primary axis.

The practical consequence: a high-priority Burstable Pod can be evicted by the kubelet under memory pressure while a low-priority Guaranteed Pod on the same node survives. This is correct behavior — QoS is about resource guarantees, priority is about scheduling order — but the asymmetry is easy to miss.

What this means for capacity planning

Set requests conservatively for any Pod you cannot afford to lose. A Pod with requests == limits (Guaranteed QoS) is the only category the kubelet will defend under pressure. Use priority for scheduling order and capacity overcommit, not as a substitute for QoS guarantees. If you want both — high priority and high QoS — set both.

For background reading on the upstream behavior, the Kubernetes documentation page on pod priority and preemption covers the scheduler side in detail.