
Kubernetes Autoscaling: HPA vs VPA vs Keda vs CA vs Karpenter vs Fargate
Autoscaling in Kubernetes operates at two layers: pod replicas and cluster nodes. Getting both right matters for cost and reliability. Over-provisioned clusters waste money. Under-provisioned ones cause scheduling failures at the worst time. The tooling spans HPA for replica scaling, VPA for resource right-sizing, KEDA for event-driven workloads, and Karpenter or Cluster Autoscaler for node capacity.
This covers all of them, the operational constraints, and why combining them correctly is what makes the difference in practice.
Horizontal Pod Autoscaler (HPA)
The most common approach to autoscaling is the HorizontalPodAutoscaler (HPA), which automatically adjusts the replica count on a Deployment or StatefulSet based on resource metrics like CPU or memory.
The HPA ships as a built-in Kubernetes API resource and controller, no extra installation needed. However, it does require a metrics source. The most common choice is the metrics-server, which scrapes kubelet data from each node and exposes it via the metrics API.
Some managed Kubernetes services like GKE include the metrics-server by default. For others, like EKS, you'll need to install it separately (via Helm or Terraform).
You can verify your setup with:
kubectl top podsConfiguring HPA
For the HPA to work, your pods must have resource requests defined. The HPA uses requests (not limits) to calculate utilization percentages.
A basic HPA configuration targeting CPU and memory might look like this:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70GitOps note: If you use ArgoCD or FluxCD, avoid setting a static
replicasfield on your Deployment. The HPA and your GitOps tool will constantly fight over the replica count.
HPA Behavior: Scale-Down Delay
By default, the HPA waits 5 minutes before scaling down after a load drop. This is intentional — preventing flapping where a brief traffic dip triggers a scale-down, only for load to return immediately and cause a scale-up. The behavior field in autoscaling/v2 lets you tune these stabilization windows per direction:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
scaleUp:
stabilizationWindowSeconds: 0For bursty workloads, aggressive scale-up (stabilization 0) with conservative scale-down (300s+) is a common production pattern. Scale-down flapping is a real operational pain during incident response when you need replicas to stay up.
Custom Metrics with Prometheus
CPU and memory aren't always the best signals for scaling. A better approach is to use application-level metrics, with the four golden signals being a solid starting point: latency, traffic, errors, and saturation.
For example, if a single pod can handle 100 requests per second, you'd want to scale based on request rate, not CPU. To do this, you need:
- Prometheus Operator: manages the lifecycle of Prometheus instances and converts ServiceMonitors/PodMonitors into native Prometheus config.
- Prometheus instance: stores your metrics.
- Prometheus Adapter: converts Prometheus metrics and registers them at the
custom.metricsAPI so the HPA can consume them.
Once set up, you can also replace the metrics-server entirely with Prometheus + cAdvisor for pod metrics, and Node Exporter for node metrics.
Vertical Pod Autoscaler (VPA)
Some applications, particularly stateful ones like standalone Postgres or MySQL, can't scale horizontally. For these, you need to scale up the existing pod by adding more CPU or memory. That's where the Vertical Pod Autoscaler (VPA) comes in.
VPA is not included with Kubernetes and must be installed separately. It offers three modes:
| Mode | Behavior |
|---|---|
Recreate |
Evicts and recreates pods with new resource recommendations. Risky for databases. |
Initial |
Sets requests/limits only at pod creation time. |
Off (recommendation only) |
Provides recommendations without taking action. Most useful for planning. |
The recommendation-only mode is especially handy: describe the VPA object, review its suggestions, and apply them during a maintenance window. It is also useful during the initial deployment of a new service where you do not yet have a good baseline for resource requests.
HPA vs VPA: Don't Mix Them
Never use HPA and VPA simultaneously on the same Deployment or StatefulSet. They will conflict and may disrupt your workloads. As a rule of thumb:
- Use HPA for stateless, horizontally scalable applications.
- Use VPA (in recommendation mode) for stateful apps that can't scale horizontally.
Event-Driven Autoscaling with KEDA
Many modern architectures rely on messaging systems, like Apache Kafka and RabbitMQ, to decouple microservices. In these setups, the right scaling signal isn't CPU. It's the number of messages in a queue.
KEDA (Kubernetes Event-Driven Autoscaler) solves this. It monitors an event source and scales your application accordingly. One major advantage: KEDA can scale your application down to zero when there are no messages, which is great for cost savings.
KEDA supports a wide range of scalers including AWS DynamoDB, Apache Kafka, MySQL, etcd, RabbitMQ, and many more.
Getting Started with KEDA
Deploy via a single Helm chart (no extra dependencies), then configure a ScaledObject:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: rabbitmq-scaledobject
spec:
scaleTargetRef:
name: myapp
triggers:
- type: rabbitmq
metadata:
queueName: myqueue
queueLength: "5" # 1 replica per 5 messagesWith this setup, publishing messages to the queue will trigger KEDA to scale your app up automatically. Once all messages are processed, it scales back down to zero.
Node Autoscaling
Scaling pods is only half the picture. If your cluster doesn't have enough nodes to schedule the new pods, you need to scale the infrastructure itself.
Cluster Autoscaler
The Cluster Autoscaler watches for pending pods that can't be scheduled. When it detects one, it increases the size of the node autoscaling group, and the cloud provider spins up a new node.
- On AWS, you need to configure permissions and deploy the controller yourself.
- On Azure and GCP, it's often a checkbox in the managed service.
The operational downside: if you use large instance types in your node groups, even a small pod that cannot fit on existing nodes triggers a full-sized new node, which may be 90% idle. The Cluster Autoscaler also has a scale-down delay (default: 10 minutes) and will not remove a node if it cannot safely reschedule the pods on it — this can cause nodes to stay up longer than expected after a scale-down event.
Karpenter
Karpenter (developed by AWS, but compatible with other clouds) takes a smarter approach. Instead of scaling a fixed node group, it analyzes the pending pods and provisions EC2 instances with exactly the right amount of CPU and memory to fit the workload.
This is generally more efficient than the Cluster Autoscaler, though it's worth noting that if you rely heavily on DaemonSets (for logging, monitoring, etc.) and prefer fewer, larger nodes to minimize agent overhead, the Cluster Autoscaler approach may still be preferable.
Serverless Kubernetes (Fargate)
If you want to avoid node management altogether, cloud providers offer serverless Kubernetes options. AWS Fargate, for example, spins up a dedicated node for each pod on demand.
This eliminates infrastructure management and wasted resources, but it comes at a significantly higher cost per CPU/memory unit compared to traditional EC2 instances.
Conclusion
Here's a quick summary of when to use each autoscaling strategy:
| Strategy | What it Scales | Best For | Key Consideration |
|---|---|---|---|
| HPA | Pod replicas | Stateless apps with variable traffic | Requires resource requests and a metrics source |
| VPA | Pod resources (CPU/memory) | Stateful apps that can't scale horizontally | Use recommendation mode to avoid unexpected restarts |
| KEDA | Pod replicas (event-driven) | Queue-based and event-driven workloads | Can scale to zero for cost savings |
| Cluster Autoscaler | Nodes (via ASG/node groups) | Clusters with predictable instance types | May over-provision with large instance types |
| Karpenter | Nodes (right-sized instances) | Clusters with diverse workload sizes | More efficient node selection, less wasted capacity |
| Fargate | Serverless pods | Teams that want zero node management | Higher cost per unit of compute |
In practice these tools are combined. A common production setup pairs HPA for pod replica scaling with Karpenter for right-sized node provisioning, and KEDA for any queue-driven workloads that need scale-to-zero. The important constraint is that node autoscaling only works when pods have resource requests defined — without requests, the scheduler has no signal and the autoscaler cannot make informed decisions about whether a new node is actually needed.
Comments