
Kubernetes Node Selector vs Node Affinity vs Pod Affinity vs Taints & Tolerations
The Kubernetes scheduler's default behavior works well for homogeneous clusters. In practice, production clusters are not homogeneous: there are spot instances for batch jobs, GPU nodes for ML workloads, Graviton nodes for cost optimization, and on-demand nodes for latency-sensitive services. Without scheduling constraints, the scheduler makes placement decisions that are correct by its scoring model but wrong for your workload's actual requirements.
This covers the four scheduling primitives, when each one fits, the failure modes to watch for, and how to combine them to get predictable placement without surprises.
Prerequisites
- A running Kubernetes cluster (any version 1.24+)
kubectlconfigured and connected to your cluster- Basic familiarity with Kubernetes Pods, Deployments, and node concepts
Goals
- Understand how the Kubernetes scheduler decides where to place pods
- Learn the differences between
nodeSelector,nodeAffinity,podAffinity, and taints/tolerations - Know when to use each strategy and how to combine them
- Have reusable YAML manifests you can adapt for your own clusters
How the Kubernetes Scheduler Works
Before diving into customization, it helps to understand the scheduling process itself. When a pod needs to be placed, the scheduler follows two steps:
- Filtering — It eliminates all nodes that don't meet the pod's requirements (e.g., not enough CPU or memory). The remaining candidates are called feasible nodes.
- Scoring — It runs a set of scoring functions against the feasible nodes and picks the one with the highest score. If there's a tie, it selects one at random. This final step is called binding.
Every scheduling strategy covered in this guide works by influencing one or both of these steps. Some add hard constraints during filtering, others nudge the scoring to prefer certain nodes.
Labeling Your Nodes
All the scheduling strategies covered here rely on label selectors. Before anything else, you need to label your nodes. You can check existing labels with:
kubectl get nodes --show-labelsTo add a custom label to a node:
kubectl label nodes <node-name> disktype=ssdManually labeling nodes with
kubectl labelis fine for testing, but labels applied this way are lost when a node is replaced by the autoscaler or after a node group rolling update. In production, assign labels at the node group level via Terraform, EKS managed node group configuration, or the Karpenter NodePool spec. Labels defined there survive node replacement.
Kubernetes also populates a standard set of labels on all nodes automatically, including kubernetes.io/hostname, topology.kubernetes.io/zone, and kubernetes.io/os. You can use these built-in labels in your scheduling rules without adding anything yourself.
nodeSelector: Simple and Effective
The simplest approach is nodeSelector. It lets you constrain a pod to only run on nodes that match specific key-value label pairs. Suppose you have a set of nodes with fast local SSDs and you've labeled them with disktype: ssd. You can constrain your deployment to run only on those nodes like this:
deployment-ssd.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: database
spec:
replicas: 1
selector:
matchLabels:
app: database
template:
metadata:
labels:
app: database
spec:
nodeSelector:
disktype: ssd
containers:
- name: postgres
image: postgres:16
ports:
- containerPort: 5432If no node matches the labels, the pod stays in a Pending state until a matching node becomes available.
nodeSelector is easy to understand and covers many use cases, but it has a limitation: you can only match on exact key-value pairs. There's no way to express "prefer this node type but fall back to another" or "schedule on any node in zone A or zone B." For that, you need nodeAffinity.
nodeAffinity: Flexible Node Targeting
nodeAffinity does everything nodeSelector does, but with much more expressive power. It supports operators like In, NotIn, Exists, DoesNotExist, Gt, and Lt, allows multiple conditions, and supports soft constraints.
Required vs. Preferred Scheduling
There are two main modes:
requiredDuringSchedulingIgnoredDuringExecutionis a hard constraint. The pod will only be scheduled on matching nodes. If no node matches, the pod staysPending.preferredDuringSchedulingIgnoredDuringExecutionis a soft constraint. Kubernetes will try to schedule on a matching node, but if it can't, it will place the pod elsewhere rather than leave it pending.
The IgnoredDuringExecution part means that if a pod is already running and a node's labels change afterward, the pod is not evicted. The rule only applies at scheduling time.
Hard Constraint Example
Here's a deployment that requires nodes to be in a specific availability zone:
deployment-zone-required.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- eu-west-1a
- eu-west-1b
containers:
- name: api-server
image: my-api:latest
ports:
- containerPort: 8080Multiple Conditions
Under matchExpressions, you can define multiple conditions:
- When conditions are in the same
matchExpressionsarray, they act as AND, meaning all must be true. - When using multiple values for the same key with the
Inoperator, they act as OR, meaning at least one must match. - When using multiple
nodeSelectorTerms, they act as OR, meaning the pod can be scheduled if any one term is satisfied.
Note: If you specify both
nodeSelectorandnodeAffinity, both must be satisfied for the pod to be scheduled onto a node.
Weighted Preferences
With preferredDuringScheduling, you can assign weights (1 to 100) to different conditions. The scheduler adds these weights to each node's score, nudging it toward the preferred node without making it a hard requirement.
deployment-spot-preference.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-worker
spec:
replicas: 5
selector:
matchLabels:
app: batch-worker
template:
metadata:
labels:
app: batch-worker
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: price
operator: In
values:
- spot
- weight: 50
preference:
matchExpressions:
- key: instance-type
operator: In
values:
- compute-optimized
containers:
- name: worker
image: my-worker:latestThis tells Kubernetes: "This pod must run on Linux. I'd strongly prefer a spot instance (weight 100), and I'd also like a compute-optimized node (weight 50), but don't block scheduling if neither is available." This is a great default approach for cost optimization.
podAffinity and podAntiAffinity: Scheduling Based on Other Pods
Sometimes you want to schedule pods relative to other pods, not just nodes. That's where podAffinity and podAntiAffinity come in.
These rules use a topology key to define what "same location" means. For example, kubernetes.io/hostname means "same node", while topology.kubernetes.io/zone means "same availability zone." The topology key label must be present on all nodes involved, or the behavior becomes unpredictable.
Note: Pod affinity and anti-affinity require substantial processing and can slow down scheduling in large clusters. The official docs recommend caution in clusters with more than a few hundred nodes.
Anti-Affinity: Spread Pods Across Nodes
The most common use case is spreading replicas across different nodes so that if one node goes down, your service stays up:
deployment-spread.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-ingress
spec:
replicas: 3
selector:
matchLabels:
app: nginx-ingress
template:
metadata:
labels:
app: nginx-ingress
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx-ingress
topologyKey: kubernetes.io/hostname
containers:
- name: nginx
image: nginx:1.27
ports:
- containerPort: 80This pattern is widely used for ingress controllers like Nginx, where spreading instances across nodes is critical for handling load and ensuring availability.
Note: Using
requiredanti-affinity means pods will goPendingif there aren't enough distinct nodes. Usepreferredif you want a best-effort spread without blocking scheduling.
Affinity: Co-locate Related Pods
podAffinity does the opposite: it places pods together on the same node or in the same topology domain. This is useful to reduce network latency between tightly coupled services. A classic example is placing a web server alongside its in-memory cache (like Redis) since they communicate heavily.
Here's the pattern from the official docs. First, deploy the Redis cache with anti-affinity to spread replicas:
deployment-redis.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
spec:
replicas: 3
selector:
matchLabels:
app: store
template:
metadata:
labels:
app: store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: kubernetes.io/hostname
containers:
- name: redis-server
image: redis:7-alpineThen deploy the web server with affinity toward the cache and anti-affinity against other web servers:
deployment-webserver.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-server
spec:
replicas: 3
selector:
matchLabels:
app: web-store
template:
metadata:
labels:
app: web-store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-store
topologyKey: kubernetes.io/hostname
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: kubernetes.io/hostname
containers:
- name: web-app
image: nginx:1.27-alpineThe result is a cluster layout where each web server is co-located with a cache instance, on three separate nodes:
| node-1 | node-2 | node-3 |
|---|---|---|
| web-server-1 | web-server-2 | web-server-3 |
| redis-cache-1 | redis-cache-2 | redis-cache-3 |
This minimizes both latency and load skew across nodes.
Namespace Scope
By default, pod affinity rules only apply within the same namespace. To extend them across all namespaces, add an empty namespaceSelector:
namespaceSelector: {}Taints and Tolerations: Repelling Unwanted Pods
While affinity is about attraction, taints are about repulsion. You can taint a node to prevent pods from being scheduled on it unless they explicitly tolerate the taint.
This is particularly useful for specialized node groups, like spot instances or GPU nodes, where you don't want general workloads accidentally landing.
Adding a Taint
kubectl taint nodes <node-name> dedicated=gpu:NoScheduleThis marks the node so that no pod will be scheduled on it unless it has a matching toleration. To remove the taint later:
kubectl taint nodes <node-name> dedicated=gpu:NoSchedule-Taint Effects
There are three possible effects:
NoScheduleprevents new pods from being scheduled on the node. Pods already running are not evicted.PreferNoScheduleis the soft version. The scheduler will try to avoid the node, but will place pods there if no better option exists.NoExecuteis the strictest. It prevents new scheduling and evicts pods that are already running on the node if they don't tolerate the taint.
Tolerating a Taint
To schedule a pod on a tainted node, add a toleration to the pod spec:
deployment-gpu.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-training
spec:
replicas: 2
selector:
matchLabels:
app: ml-training
template:
metadata:
labels:
app: ml-training
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
nodeSelector:
dedicated: gpu
containers:
- name: trainer
image: my-ml-image:latest
resources:
limits:
nvidia.com/gpu: 1Note: Notice that this manifest combines a toleration with a
nodeSelector. The toleration allows the pod to land on the GPU node, while thenodeSelectorensures it goes there. Without thenodeSelector, the pod could still end up on a non-GPU node, which is not what you want.
Tolerating All Taints
For monitoring or logging agents like Fluentbit that need to run on every node regardless of taints, you can tolerate all taints at once:
tolerations:
- operator: "Exists"This is common in DaemonSet specs.
Built-in Taints
A real-world example you've likely already encountered: Kubernetes control plane nodes are tainted by default with node-role.kubernetes.io/control-plane:NoSchedule to prevent user workloads from running on master nodes.
Kubernetes also automatically taints nodes when problems occur. For example, if a node becomes unreachable, the control plane adds a node.kubernetes.io/unreachable:NoExecute taint. By default, pods tolerate this for 300 seconds (5 minutes) before being evicted, giving the node a window to recover.
The full list of automatic taints includes: node.kubernetes.io/not-ready, node.kubernetes.io/unreachable, node.kubernetes.io/memory-pressure, node.kubernetes.io/disk-pressure, node.kubernetes.io/pid-pressure, and node.kubernetes.io/network-unavailable.
Combining Approaches
Taints, tolerations, and affinity rules are complementary. In practice, you'll often use them together:
- Taints prevent unwanted pods from landing on specialized nodes
- Tolerations allow the right pods to bypass those taints
- nodeAffinity or nodeSelector guides the right pods toward those nodes
- podAntiAffinity spreads replicas for high availability
For example, to properly isolate GPU workloads:
- Taint the GPU nodes:
dedicated=gpu:NoSchedule - Label them:
dedicated=gpu - In your GPU workload, add both a toleration for the taint and a
nodeSelector(ornodeAffinity) targeting the label
This two-step approach is the recommended pattern. Taints alone only prevent the wrong pods from arriving, they don't guarantee the right pods will go to the right place. Affinity rules alone don't prevent other pods from consuming resources on your specialized nodes. Used together, they give you precise, predictable control.
Summary
Here's a quick comparison of when to use each scheduling strategy:
| Strategy | Type | Best For | Limitation |
|---|---|---|---|
nodeSelector |
Hard constraint | Simple label matching (e.g., SSD, GPU) | No soft preferences, exact match only |
nodeAffinity (required) |
Hard constraint | Expressive node targeting with operators | Pod stays Pending if no match |
nodeAffinity (preferred) |
Soft constraint | Cost optimization, preferred zones | No guarantee of preferred placement |
podAffinity |
Hard or soft | Co-locating related services for low latency | Expensive at scale (hundreds of nodes) |
podAntiAffinity |
Hard or soft | Spreading replicas for high availability | Requires enough nodes for all replicas |
| Taints + Tolerations | Repulsion | Isolating specialized nodes (GPU, spot) | Only prevents scheduling, doesn't attract pods |
The two patterns that come up most often in production are podAntiAffinity for spreading replicas across nodes or zones, and the taint-plus-nodeSelector combination for isolating specialized node groups. The rest of the primitives are useful but situational. The key operational point: scheduling rules only apply at pod creation time (the IgnoredDuringExecution part of the field names is not a typo). If you add a taint to a node after pods are running on it, those pods are not evicted unless you use NoExecute as the effect.
Comments