August 13, 2024

Kubernetes Autoscaling: HPA vs VPA vs Keda vs CA vs Karpenter vs Fargate

Autoscaling in Kubernetes operates at two layers: pod replicas and cluster nodes. Getting both right matters for cost and reliability. Over-provisioned clusters waste money. Under-provisioned ones cause scheduling failures at the worst time. The tooling spans HPA for replica scaling, VPA for resource right-sizing, KEDA for event-driven workloads, and Karpenter or Cluster Autoscaler for node capacity.

This covers all of them, the operational constraints, and why combining them correctly is what makes the difference in practice.

Horizontal Pod Autoscaler (HPA)

The most common approach to autoscaling is the HorizontalPodAutoscaler (HPA), which automatically adjusts the replica count on a Deployment or StatefulSet based on resource metrics like CPU or memory.

The HPA ships as a built-in Kubernetes API resource and controller, no extra installation needed. However, it does require a metrics source. The most common choice is the metrics-server, which scrapes kubelet data from each node and exposes it via the metrics API.

Some managed Kubernetes services like GKE include the metrics-server by default. For others, like EKS, you'll need to install it separately (via Helm or Terraform).

You can verify your setup with:

kubectl top pods

Configuring HPA

For the HPA to work, your pods must have resource requests defined. The HPA uses requests (not limits) to calculate utilization percentages.

A basic HPA configuration targeting CPU and memory might look like this:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70

GitOps note: If you use ArgoCD or FluxCD, avoid setting a static replicas field on your Deployment. The HPA and your GitOps tool will constantly fight over the replica count.

HPA Behavior: Scale-Down Delay

By default, the HPA waits 5 minutes before scaling down after a load drop. This is intentional — preventing flapping where a brief traffic dip triggers a scale-down, only for load to return immediately and cause a scale-up. The behavior field in autoscaling/v2 lets you tune these stabilization windows per direction:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300
  scaleUp:
    stabilizationWindowSeconds: 0

For bursty workloads, aggressive scale-up (stabilization 0) with conservative scale-down (300s+) is a common production pattern. Scale-down flapping is a real operational pain during incident response when you need replicas to stay up.

Custom Metrics with Prometheus

CPU and memory aren't always the best signals for scaling. A better approach is to use application-level metrics, with the four golden signals being a solid starting point: latency, traffic, errors, and saturation.

For example, if a single pod can handle 100 requests per second, you'd want to scale based on request rate, not CPU. To do this, you need:

Prometheus Operator: manages the lifecycle of Prometheus instances and converts ServiceMonitors/PodMonitors into native Prometheus config.
Prometheus instance: stores your metrics.
Prometheus Adapter: converts Prometheus metrics and registers them at the custom.metrics API so the HPA can consume them.

Once set up, you can also replace the metrics-server entirely with Prometheus + cAdvisor for pod metrics, and Node Exporter for node metrics.

Vertical Pod Autoscaler (VPA)

Some applications, particularly stateful ones like standalone Postgres or MySQL, can't scale horizontally. For these, you need to scale up the existing pod by adding more CPU or memory. That's where the Vertical Pod Autoscaler (VPA) comes in.

VPA is not included with Kubernetes and must be installed separately. It offers three modes:

Mode	Behavior
`Recreate`	Evicts and recreates pods with new resource recommendations. Risky for databases.
`Initial`	Sets requests/limits only at pod creation time.
`Off` (recommendation only)	Provides recommendations without taking action. Most useful for planning.

The recommendation-only mode is especially handy: describe the VPA object, review its suggestions, and apply them during a maintenance window. It is also useful during the initial deployment of a new service where you do not yet have a good baseline for resource requests.

HPA vs VPA: Don't Mix Them

Never use HPA and VPA simultaneously on the same Deployment or StatefulSet. They will conflict and may disrupt your workloads. As a rule of thumb:

Use HPA for stateless, horizontally scalable applications.
Use VPA (in recommendation mode) for stateful apps that can't scale horizontally.

Event-Driven Autoscaling with KEDA

Many modern architectures rely on messaging systems, like Apache Kafka and RabbitMQ, to decouple microservices. In these setups, the right scaling signal isn't CPU. It's the number of messages in a queue.

KEDA (Kubernetes Event-Driven Autoscaler) solves this. It monitors an event source and scales your application accordingly. One major advantage: KEDA can scale your application down to zero when there are no messages, which is great for cost savings.

KEDA supports a wide range of scalers including AWS DynamoDB, Apache Kafka, MySQL, etcd, RabbitMQ, and many more.

Getting Started with KEDA

Deploy via a single Helm chart (no extra dependencies), then configure a ScaledObject:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: rabbitmq-scaledobject
spec:
  scaleTargetRef:
    name: myapp
  triggers:
    - type: rabbitmq
      metadata:
        queueName: myqueue
        queueLength: "5"  # 1 replica per 5 messages

With this setup, publishing messages to the queue will trigger KEDA to scale your app up automatically. Once all messages are processed, it scales back down to zero.

Node Autoscaling

Scaling pods is only half the picture. If your cluster doesn't have enough nodes to schedule the new pods, you need to scale the infrastructure itself.

Cluster Autoscaler

The Cluster Autoscaler watches for pending pods that can't be scheduled. When it detects one, it increases the size of the node autoscaling group, and the cloud provider spins up a new node.

On AWS, you need to configure permissions and deploy the controller yourself.
On Azure and GCP, it's often a checkbox in the managed service.

The operational downside: if you use large instance types in your node groups, even a small pod that cannot fit on existing nodes triggers a full-sized new node, which may be 90% idle. The Cluster Autoscaler also has a scale-down delay (default: 10 minutes) and will not remove a node if it cannot safely reschedule the pods on it — this can cause nodes to stay up longer than expected after a scale-down event.

Karpenter

Karpenter (developed by AWS, but compatible with other clouds) takes a smarter approach. Instead of scaling a fixed node group, it analyzes the pending pods and provisions EC2 instances with exactly the right amount of CPU and memory to fit the workload.

This is generally more efficient than the Cluster Autoscaler, though it's worth noting that if you rely heavily on DaemonSets (for logging, monitoring, etc.) and prefer fewer, larger nodes to minimize agent overhead, the Cluster Autoscaler approach may still be preferable.

Serverless Kubernetes (Fargate)

If you want to avoid node management altogether, cloud providers offer serverless Kubernetes options. AWS Fargate, for example, spins up a dedicated node for each pod on demand.

This eliminates infrastructure management and wasted resources, but it comes at a significantly higher cost per CPU/memory unit compared to traditional EC2 instances.

Conclusion

Here's a quick summary of when to use each autoscaling strategy:

Strategy	What it Scales	Best For	Key Consideration
HPA	Pod replicas	Stateless apps with variable traffic	Requires resource requests and a metrics source
VPA	Pod resources (CPU/memory)	Stateful apps that can't scale horizontally	Use recommendation mode to avoid unexpected restarts
KEDA	Pod replicas (event-driven)	Queue-based and event-driven workloads	Can scale to zero for cost savings
Cluster Autoscaler	Nodes (via ASG/node groups)	Clusters with predictable instance types	May over-provision with large instance types
Karpenter	Nodes (right-sized instances)	Clusters with diverse workload sizes	More efficient node selection, less wasted capacity
Fargate	Serverless pods	Teams that want zero node management	Higher cost per unit of compute

In practice these tools are combined. A common production setup pairs HPA for pod replica scaling with Karpenter for right-sized node provisioning, and KEDA for any queue-driven workloads that need scale-to-zero. The important constraint is that node autoscaling only works when pods have resource requests defined — without requests, the scheduler has no signal and the autoscaler cannot make informed decisions about whether a new node is actually needed.