Karpenter v1 Cost Optimization Guide (2026)

Karpenter has quietly become the default node autoscaler for production Kubernetes on AWS, and now that the v1 API is finally stable, teams that fully lean into its consolidation, disruption budgets, and Spot-first NodePools are routinely cutting EKS compute bills by 40% to 70%. The catch? Most clusters still run on the Helm chart defaults, leaving 20% to 30% of compute spend on the table. So let's fix that.

This guide walks through the exact NodePool, EC2NodeClass, and disruption configuration that produces real savings in 2026 — with copy-paste YAML and the gotchas that quietly inflate your bill.

If you're migrating from Cluster Autoscaler (or limping along on an early Karpenter v0.32 setup), this is the upgrade path that pays for itself in the first billing cycle. Honestly, I've seen it happen — one team I worked with shaved nearly $18k off their monthly EKS bill within two weeks, and they hadn't even touched their pod requests yet.

Why Karpenter Beats Cluster Autoscaler on Cost

Cluster Autoscaler scales pre-defined Auto Scaling Groups. You pick instance families up front, the ASG scales nodes of that family, and bin-packing is constrained by whatever you guessed at provisioning time. Karpenter takes the opposite approach: it reads pending pods, computes the cheapest EC2 instance that fits the resource requests, and provisions that instance directly through the EC2 Fleet API.

The cost implications are pretty significant:

Better bin-packing. Karpenter consolidates workloads onto fewer, larger nodes when that's cheaper, then switches to smaller nodes when demand drops.
Mixed instance types per pool. A single NodePool can pull from dozens of instance families and sizes, so Karpenter always picks the cheapest matching SKU at provision time.
Spot-first scheduling. Karpenter prioritizes Spot capacity by default and gracefully falls back to On-Demand only when Spot is unavailable.
Faster scale-up. Provisioning takes seconds rather than minutes, which means you can run leaner steady-state capacity without sacrificing burst performance.

According to AWS EKS Best Practices and case studies from production users, Karpenter v1 typically delivers 30% to 50% EC2 reductions through consolidation alone — and up to 90% on burstable CI/CD workloads when paired with Spot.

Step 1: Install Karpenter v1 on EKS

Karpenter v1.0 shipped a stable API, so you no longer need to track alpha CRDs (small mercy, but a real one). Install via the official Helm chart, and run the controller on Fargate so node disruption never takes Karpenter itself offline.

helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version "1.2.0" \
  --namespace kube-system \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --wait

Then create a Fargate profile so the controller pods land on Fargate, not on Karpenter-managed nodes:

eksctl create fargateprofile \
  --cluster ${CLUSTER_NAME} \
  --name karpenter \
  --namespace kube-system \
  --labels app.kubernetes.io/name=karpenter

This eliminates the chicken-and-egg failure mode where Karpenter can't provision a replacement for the node it was running on. Trust me, you do not want to debug that one at 2am.

Step 2: Build a Cost-Optimized EC2NodeClass

The EC2NodeClass defines the AWS-specific properties of nodes Karpenter creates. The biggest cost lever here is the AMI family and root volume size — gp3 with 50 GiB is plenty for most stateless workloads.

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2023
  amiSelectorTerms:
    - alias: al2023@latest
  role: "KarpenterNodeRole-${CLUSTER_NAME}"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${CLUSTER_NAME}
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${CLUSTER_NAME}
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 50Gi
        volumeType: gp3
        iops: 3000
        throughput: 125
        encrypted: true
        deleteOnTermination: true
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 1
    httpTokens: required
  tags:
    cost-center: platform
    managed-by: karpenter

A couple of cost notes worth flagging: gp3 is roughly 20% cheaper than gp2 at equivalent IOPS, and capping IOPS/throughput at the gp3 baseline avoids surprise charges. And tag every node — your cost allocation reports are basically useless if you can't break spend down by team or environment.

Step 3: The Spot-First NodePool That Saves 60%+

This is where most of the savings actually come from. Three rules separate cost-effective NodePools from defaults:

Allow a wide range of instance families and sizes — diversity drives Spot availability.
Prefer Spot, fall back to On-Demand only when needed.
Use WhenEmptyOrUnderutilized consolidation with disruption budgets.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata:
      labels:
        workload-tier: standard
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values: ["2", "4", "8", "16", "32"]
      expireAfter: 720h
      terminationGracePeriod: 5m
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m
    budgets:
      - nodes: "20%"
      - nodes: "0"
        schedule: "0 9 * * mon-fri"
        duration: 8h
        reasons: ["Underutilized"]
  limits:
    cpu: 1000
    memory: 1000Gi
  weight: 100

A few of these choices are deliberate, so let me walk through them:

Both amd64 and arm64 are allowed. Karpenter will pick Graviton when your image supports it, saving an additional 20% on price-performance.
Generations greater than 5 excludes ancient instance types that no longer offer the best price-performance, but still leaves c6, m6, r6, c7, m7, r7 families in play.
The 20% disruption budget caps how many nodes Karpenter can consolidate at once — protecting against thundering-herd reschedules.
The time-based budget with nodes: "0" from 9am Monday-Friday for 8 hours blocks underutilization-driven consolidations during business hours, when traffic spikes are most likely.

Step 4: A Stable NodePool for Single-Replica and Stateful Workloads

Aggressive consolidation works beautifully for stateless replicas. For databases, stateful sets, and single-pod controllers? Not so much. Build a second pool with conservative disruption settings, then taint it so only opted-in workloads land there.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: stable
spec:
  template:
    metadata:
      labels:
        workload-tier: stable
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      taints:
        - key: workload-tier
          value: stable
          effect: NoSchedule
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["m", "r"]
      expireAfter: 2160h
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m
    budgets:
      - nodes: "10%"
  weight: 50

Workloads that need this pool just add a matching toleration:

tolerations:
  - key: workload-tier
    operator: Equal
    value: stable
    effect: NoSchedule
nodeSelector:
  workload-tier: stable

This dual-pool pattern lets you run the cheap, aggressively-consolidated default for ~80% of workloads and reserve the more expensive, less-disruptive pool for the 20% that genuinely need it. It's the single most impactful pattern I've shipped in the last year.

Step 5: Enable Spot Interruption Handling

Spot only works if your cluster reacts gracefully to AWS's two-minute interruption notice. Karpenter listens to interruption events through an SQS queue, cordons the node, and provisions a replacement before pods are evicted.

Provision the queue with Terraform:

resource "aws_sqs_queue" "karpenter_interruption" {
  name                      = var.cluster_name
  message_retention_seconds = 300
  sqs_managed_sse_enabled   = true
}

resource "aws_cloudwatch_event_rule" "spot_interruption" {
  name        = "${var.cluster_name}-spot-interruption"
  description = "Capture EC2 Spot Instance Interruption Warnings"
  event_pattern = jsonencode({
    source      = ["aws.ec2"]
    detail-type = ["EC2 Spot Instance Interruption Warning"]
  })
}

resource "aws_cloudwatch_event_target" "spot_interruption" {
  rule      = aws_cloudwatch_event_rule.spot_interruption.name
  target_id = "KarpenterInterruptionQueue"
  arn       = aws_sqs_queue.karpenter_interruption.arn
}

Add equivalent rules for EC2 Instance Rebalance Recommendation, EC2 Instance State-change Notification, and AWS Health Event to handle scheduled maintenance and AZ rebalancing. Skipping any of these is a real "future-you problem."

Step 6: Right-Size Pod Requests Before Trusting Consolidation

Karpenter's bin-packing depends entirely on your resources.requests values. If pods over-request, Karpenter provisions oversized nodes and consolidation has nothing to compress. If pods under-request, consolidation packs them too tight and triggers OOMKills. Either way, you're leaving money on the table — or worse, paging your on-call.

Two practical rules I keep coming back to:

For memory, set requests == limits. Memory is non-compressible. Without this, an aggressive consolidation can pack a node so tightly that any burst pushes a pod over its working set and triggers an OOM.
For CPU, set requests based on observed p95 usage. Use VPA in recommendation mode or Goldilocks to surface the right values, then apply them to your manifests.

A minimal Goldilocks install:

helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm install goldilocks fairwinds-stable/goldilocks --namespace goldilocks --create-namespace
kubectl label namespace your-app-namespace goldilocks.fairwinds.com/enabled=true

After 24 to 48 hours, the Goldilocks dashboard surfaces request recommendations grounded in actual usage. It's worth the wait.

Step 7: Monitor Karpenter Cost Impact

Without a feedback loop, you can't tell whether your tuning is actually helping (or quietly making things worse). The minimum monitoring set:

Karpenter metrics endpoint exposes karpenter_nodes_created_total, karpenter_nodes_terminated_total, and karpenter_pods_startup_duration_seconds. Scrape into Prometheus and chart in Grafana.
AWS Cost and Usage Report (CUR) filtered by the karpenter.sh/nodepool tag shows spend per pool over time.
Kubecost or OpenCost attributes node spend back to namespaces, deployments, and labels — really the only way to prove which app actually saved money.

A Grafana panel that compares karpenter_nodes_created_total{capacity_type="spot"} versus capacity_type="on-demand" tells you immediately whether your Spot strategy is working. A healthy production cluster should see 70% to 90% of new nodes provisioned as Spot.

Step 8: Drift Management for Continuous Optimization

When you publish a new EC2NodeClass — say, switching to a newer AMI or a smaller volume — Karpenter detects drift and replaces nodes during the disruption budget window. This is how cost optimizations roll out automatically once they merge:

kubectl get nodeclaims -o json | jq '.items[] | {name: .metadata.name, drifted: .status.conditions[] | select(.type=="Drifted").status}'

Pair drift with monthly AMI updates and a quarterly review of allowed instance families to keep your cluster on the price-performance frontier.

Common Pitfalls That Kill Karpenter Savings

Restricting to a single instance family. Limiting to m5.large only kills Spot availability and consolidation flexibility. Always allow at least three families and four sizes.
PodDisruptionBudgets blocking everything. A PDB with maxUnavailable: 0 permanently pins a node. Use percentage-based PDBs and verify Karpenter can still drain.
Running Karpenter on Karpenter-managed nodes. A consolidation event that removes the controller pod prevents Karpenter from provisioning its replacement. Always use Fargate or a managed node group for the controller.
Forgetting do-not-disrupt annotations on long jobs. Batch jobs that take longer than the consolidation interval should set karpenter.sh/do-not-disrupt: "true" to avoid mid-job termination.
Setting consolidateAfter too low. Values under 30 seconds cause thrashing during traffic spikes. Start at 1m and tune from there.

Real-World Savings Benchmarks for 2026

Based on aggregated case studies from production EKS clusters running Karpenter v1 in 2026:

Stateless web workloads: 40% to 55% reduction versus Cluster Autoscaler on-demand baseline.
CI/CD runners on Spot: 75% to 90% reduction. Build queues are nearly perfectly suited for Spot interruption tolerance.
Mixed microservice clusters: 30% to 50% reduction with the dual NodePool pattern described above.
ML inference (Graviton): 20% additional savings on top of Spot when migrating to ARM64.

The path to the upper end of these ranges is the same in every case: broad instance diversity, Spot-first capacity types, aggressive consolidation with disruption budgets, and accurate pod requests. There's no secret sauce — just a handful of knobs turned in the right direction.

Frequently Asked Questions

What is the difference between Karpenter and Cluster Autoscaler?

Cluster Autoscaler scales pre-configured EC2 Auto Scaling Groups, which means you have to define instance families and sizes up front. Karpenter provisions nodes directly through the EC2 Fleet API based on the actual scheduling needs of pending pods, so it can pick the cheapest matching instance type from a wide pool, mix Spot and On-Demand intelligently, and consolidate workloads automatically. In practice, Karpenter scales faster (seconds vs minutes) and reduces compute costs by 30% to 50% on typical EKS clusters.

How much can Karpenter save on AWS EKS in 2026?

Production deployments report 30% to 70% EC2 cost reductions, with 75% to 90% savings on Spot-eligible workloads such as CI/CD pipelines, batch jobs, and stateless microservices. Savings depend on three factors: how broadly you allow instance type diversity, how aggressively you enable consolidation, and how accurate your pod resource requests are.

Is Karpenter safe for production with Spot instances?

Yes — when properly configured. Enable interruption handling through an SQS queue so Karpenter receives the AWS two-minute Spot interruption notice, cordons the node, and provisions a replacement before pods are evicted. Pair this with PodDisruptionBudgets, multiple replicas, and a separate stable NodePool for stateful or single-replica workloads. Major companies run mission-critical workloads on Karpenter with Spot at 80%+ of total capacity.

What does the WhenEmptyOrUnderutilized consolidation policy do?

It tells Karpenter to actively replace underutilized nodes with smaller or cheaper alternatives, in addition to deleting fully empty nodes. This is the most aggressive cost-saving consolidation policy and was renamed from WhenUnderutilized in the v1 API. Use it on stateless workload pools combined with disruption budgets to control the rate of node replacement.

Should I use Karpenter on Azure AKS or GCP GKE?

Karpenter started as an AWS-specific project but has been generalized into a Kubernetes SIG. Azure AKS now offers Node Auto Provisioning (NAP), which is built on Karpenter under the hood and provides similar capabilities for Azure spot VMs. GCP doesn't yet have a first-class Karpenter provider — GKE Autopilot remains the recommended autoscaler there. If you run multi-cloud, expect Karpenter on EKS, Karpenter-based NAP on AKS, and Autopilot on GKE in 2026.