Thumbnail

Kubernetes Node Selector vs Node Affinity vs Pod Affinity vs Taints & Tolerations

The Kubernetes scheduler's default behavior works well for homogeneous clusters. In practice, production clusters are not homogeneous: there are spot instances for batch jobs, GPU nodes for ML workloads, Graviton nodes for cost optimization, and on-demand nodes for latency-sensitive services. Without scheduling constraints, the scheduler makes placement decisions that are correct by its scoring model but wrong for your workload's actual requirements.

This covers the four scheduling primitives, when each one fits, the failure modes to watch for, and how to combine them to get predictable placement without surprises.

Prerequisites

  • A running Kubernetes cluster (any version 1.24+)
  • kubectl configured and connected to your cluster
  • Basic familiarity with Kubernetes Pods, Deployments, and node concepts

Goals

  • Understand how the Kubernetes scheduler decides where to place pods
  • Learn the differences between nodeSelector, nodeAffinity, podAffinity, and taints/tolerations
  • Know when to use each strategy and how to combine them
  • Have reusable YAML manifests you can adapt for your own clusters

How the Kubernetes Scheduler Works

Before diving into customization, it helps to understand the scheduling process itself. When a pod needs to be placed, the scheduler follows two steps:

  1. Filtering — It eliminates all nodes that don't meet the pod's requirements (e.g., not enough CPU or memory). The remaining candidates are called feasible nodes.
  2. Scoring — It runs a set of scoring functions against the feasible nodes and picks the one with the highest score. If there's a tie, it selects one at random. This final step is called binding.

Every scheduling strategy covered in this guide works by influencing one or both of these steps. Some add hard constraints during filtering, others nudge the scoring to prefer certain nodes.

Labeling Your Nodes

All the scheduling strategies covered here rely on label selectors. Before anything else, you need to label your nodes. You can check existing labels with:

kubectl get nodes --show-labels

To add a custom label to a node:

kubectl label nodes <node-name> disktype=ssd

Manually labeling nodes with kubectl label is fine for testing, but labels applied this way are lost when a node is replaced by the autoscaler or after a node group rolling update. In production, assign labels at the node group level via Terraform, EKS managed node group configuration, or the Karpenter NodePool spec. Labels defined there survive node replacement.

Kubernetes also populates a standard set of labels on all nodes automatically, including kubernetes.io/hostname, topology.kubernetes.io/zone, and kubernetes.io/os. You can use these built-in labels in your scheduling rules without adding anything yourself.

nodeSelector: Simple and Effective

The simplest approach is nodeSelector. It lets you constrain a pod to only run on nodes that match specific key-value label pairs. Suppose you have a set of nodes with fast local SSDs and you've labeled them with disktype: ssd. You can constrain your deployment to run only on those nodes like this:

deployment-ssd.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: database
spec:
  replicas: 1
  selector:
    matchLabels:
      app: database
  template:
    metadata:
      labels:
        app: database
    spec:
      nodeSelector:
        disktype: ssd
      containers:
        - name: postgres
          image: postgres:16
          ports:
            - containerPort: 5432

If no node matches the labels, the pod stays in a Pending state until a matching node becomes available.

nodeSelector is easy to understand and covers many use cases, but it has a limitation: you can only match on exact key-value pairs. There's no way to express "prefer this node type but fall back to another" or "schedule on any node in zone A or zone B." For that, you need nodeAffinity.

nodeAffinity: Flexible Node Targeting

nodeAffinity does everything nodeSelector does, but with much more expressive power. It supports operators like In, NotIn, Exists, DoesNotExist, Gt, and Lt, allows multiple conditions, and supports soft constraints.

Required vs. Preferred Scheduling

There are two main modes:

  • requiredDuringSchedulingIgnoredDuringExecution is a hard constraint. The pod will only be scheduled on matching nodes. If no node matches, the pod stays Pending.
  • preferredDuringSchedulingIgnoredDuringExecution is a soft constraint. Kubernetes will try to schedule on a matching node, but if it can't, it will place the pod elsewhere rather than leave it pending.

The IgnoredDuringExecution part means that if a pod is already running and a node's labels change afterward, the pod is not evicted. The rule only applies at scheduling time.

Hard Constraint Example

Here's a deployment that requires nodes to be in a specific availability zone:

deployment-zone-required.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values:
                      - eu-west-1a
                      - eu-west-1b
      containers:
        - name: api-server
          image: my-api:latest
          ports:
            - containerPort: 8080

Multiple Conditions

Under matchExpressions, you can define multiple conditions:

  • When conditions are in the same matchExpressions array, they act as AND, meaning all must be true.
  • When using multiple values for the same key with the In operator, they act as OR, meaning at least one must match.
  • When using multiple nodeSelectorTerms, they act as OR, meaning the pod can be scheduled if any one term is satisfied.

Note: If you specify both nodeSelector and nodeAffinity, both must be satisfied for the pod to be scheduled onto a node.

Weighted Preferences

With preferredDuringScheduling, you can assign weights (1 to 100) to different conditions. The scheduler adds these weights to each node's score, nudging it toward the preferred node without making it a hard requirement.

deployment-spot-preference.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-worker
spec:
  replicas: 5
  selector:
    matchLabels:
      app: batch-worker
  template:
    metadata:
      labels:
        app: batch-worker
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/os
                    operator: In
                    values:
                      - linux
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: price
                    operator: In
                    values:
                      - spot
            - weight: 50
              preference:
                matchExpressions:
                  - key: instance-type
                    operator: In
                    values:
                      - compute-optimized
      containers:
        - name: worker
          image: my-worker:latest

This tells Kubernetes: "This pod must run on Linux. I'd strongly prefer a spot instance (weight 100), and I'd also like a compute-optimized node (weight 50), but don't block scheduling if neither is available." This is a great default approach for cost optimization.

podAffinity and podAntiAffinity: Scheduling Based on Other Pods

Sometimes you want to schedule pods relative to other pods, not just nodes. That's where podAffinity and podAntiAffinity come in.

These rules use a topology key to define what "same location" means. For example, kubernetes.io/hostname means "same node", while topology.kubernetes.io/zone means "same availability zone." The topology key label must be present on all nodes involved, or the behavior becomes unpredictable.

Note: Pod affinity and anti-affinity require substantial processing and can slow down scheduling in large clusters. The official docs recommend caution in clusters with more than a few hundred nodes.

Anti-Affinity: Spread Pods Across Nodes

The most common use case is spreading replicas across different nodes so that if one node goes down, your service stays up:

deployment-spread.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-ingress
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx-ingress
  template:
    metadata:
      labels:
        app: nginx-ingress
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - nginx-ingress
              topologyKey: kubernetes.io/hostname
      containers:
        - name: nginx
          image: nginx:1.27
          ports:
            - containerPort: 80

This pattern is widely used for ingress controllers like Nginx, where spreading instances across nodes is critical for handling load and ensuring availability.

Note: Using required anti-affinity means pods will go Pending if there aren't enough distinct nodes. Use preferred if you want a best-effort spread without blocking scheduling.

podAffinity does the opposite: it places pods together on the same node or in the same topology domain. This is useful to reduce network latency between tightly coupled services. A classic example is placing a web server alongside its in-memory cache (like Redis) since they communicate heavily.

Here's the pattern from the official docs. First, deploy the Redis cache with anti-affinity to spread replicas:

deployment-redis.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  replicas: 3
  selector:
    matchLabels:
      app: store
  template:
    metadata:
      labels:
        app: store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - store
              topologyKey: kubernetes.io/hostname
      containers:
        - name: redis-server
          image: redis:7-alpine

Then deploy the web server with affinity toward the cache and anti-affinity against other web servers:

deployment-webserver.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-store
  template:
    metadata:
      labels:
        app: web-store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - web-store
              topologyKey: kubernetes.io/hostname
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - store
              topologyKey: kubernetes.io/hostname
      containers:
        - name: web-app
          image: nginx:1.27-alpine

The result is a cluster layout where each web server is co-located with a cache instance, on three separate nodes:

node-1 node-2 node-3
web-server-1 web-server-2 web-server-3
redis-cache-1 redis-cache-2 redis-cache-3

This minimizes both latency and load skew across nodes.

Namespace Scope

By default, pod affinity rules only apply within the same namespace. To extend them across all namespaces, add an empty namespaceSelector:

namespaceSelector: {}

Taints and Tolerations: Repelling Unwanted Pods

While affinity is about attraction, taints are about repulsion. You can taint a node to prevent pods from being scheduled on it unless they explicitly tolerate the taint.

This is particularly useful for specialized node groups, like spot instances or GPU nodes, where you don't want general workloads accidentally landing.

Adding a Taint

kubectl taint nodes <node-name> dedicated=gpu:NoSchedule

This marks the node so that no pod will be scheduled on it unless it has a matching toleration. To remove the taint later:

kubectl taint nodes <node-name> dedicated=gpu:NoSchedule-

Taint Effects

There are three possible effects:

  • NoSchedule prevents new pods from being scheduled on the node. Pods already running are not evicted.
  • PreferNoSchedule is the soft version. The scheduler will try to avoid the node, but will place pods there if no better option exists.
  • NoExecute is the strictest. It prevents new scheduling and evicts pods that are already running on the node if they don't tolerate the taint.

Tolerating a Taint

To schedule a pod on a tainted node, add a toleration to the pod spec:

deployment-gpu.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-training
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ml-training
  template:
    metadata:
      labels:
        app: ml-training
    spec:
      tolerations:
        - key: "dedicated"
          operator: "Equal"
          value: "gpu"
          effect: "NoSchedule"
      nodeSelector:
        dedicated: gpu
      containers:
        - name: trainer
          image: my-ml-image:latest
          resources:
            limits:
              nvidia.com/gpu: 1

Note: Notice that this manifest combines a toleration with a nodeSelector. The toleration allows the pod to land on the GPU node, while the nodeSelector ensures it goes there. Without the nodeSelector, the pod could still end up on a non-GPU node, which is not what you want.

Tolerating All Taints

For monitoring or logging agents like Fluentbit that need to run on every node regardless of taints, you can tolerate all taints at once:

tolerations:
  - operator: "Exists"

This is common in DaemonSet specs.

Built-in Taints

A real-world example you've likely already encountered: Kubernetes control plane nodes are tainted by default with node-role.kubernetes.io/control-plane:NoSchedule to prevent user workloads from running on master nodes.

Kubernetes also automatically taints nodes when problems occur. For example, if a node becomes unreachable, the control plane adds a node.kubernetes.io/unreachable:NoExecute taint. By default, pods tolerate this for 300 seconds (5 minutes) before being evicted, giving the node a window to recover.

The full list of automatic taints includes: node.kubernetes.io/not-ready, node.kubernetes.io/unreachable, node.kubernetes.io/memory-pressure, node.kubernetes.io/disk-pressure, node.kubernetes.io/pid-pressure, and node.kubernetes.io/network-unavailable.

Combining Approaches

Taints, tolerations, and affinity rules are complementary. In practice, you'll often use them together:

  • Taints prevent unwanted pods from landing on specialized nodes
  • Tolerations allow the right pods to bypass those taints
  • nodeAffinity or nodeSelector guides the right pods toward those nodes
  • podAntiAffinity spreads replicas for high availability

For example, to properly isolate GPU workloads:

  1. Taint the GPU nodes: dedicated=gpu:NoSchedule
  2. Label them: dedicated=gpu
  3. In your GPU workload, add both a toleration for the taint and a nodeSelector (or nodeAffinity) targeting the label

This two-step approach is the recommended pattern. Taints alone only prevent the wrong pods from arriving, they don't guarantee the right pods will go to the right place. Affinity rules alone don't prevent other pods from consuming resources on your specialized nodes. Used together, they give you precise, predictable control.

Summary

Here's a quick comparison of when to use each scheduling strategy:

Strategy Type Best For Limitation
nodeSelector Hard constraint Simple label matching (e.g., SSD, GPU) No soft preferences, exact match only
nodeAffinity (required) Hard constraint Expressive node targeting with operators Pod stays Pending if no match
nodeAffinity (preferred) Soft constraint Cost optimization, preferred zones No guarantee of preferred placement
podAffinity Hard or soft Co-locating related services for low latency Expensive at scale (hundreds of nodes)
podAntiAffinity Hard or soft Spreading replicas for high availability Requires enough nodes for all replicas
Taints + Tolerations Repulsion Isolating specialized nodes (GPU, spot) Only prevents scheduling, doesn't attract pods

The two patterns that come up most often in production are podAntiAffinity for spreading replicas across nodes or zones, and the taint-plus-nodeSelector combination for isolating specialized node groups. The rest of the primitives are useful but situational. The key operational point: scheduling rules only apply at pod creation time (the IgnoredDuringExecution part of the field names is not a typo). If you add a taint to a node after pods are running on it, those pods are not evicted unless you use NoExecute as the effect.

Comments