Why Automated Cloud Waste Cleanup Matters in 2026
Cloud waste is a stubborn problem. Despite a full decade of FinOps advocacy and an ever-growing ecosystem of cost management tools, organizations still waste 30–40% of their cloud budgets on idle resources, unattached storage volumes, orphaned snapshots, and dev/test environments running 24/7. On a global cloud spending base that now exceeds $1 trillion, that waste adds up to hundreds of billions of dollars every year.
The root cause is honestly pretty simple: cloud resources are easy to create and easy to forget.
A developer spins up an EC2 instance for a proof-of-concept, finishes the test, and moves on — leaving the instance running indefinitely. A database snapshot accumulates daily for months after the migration project wraps up. An Elastic IP sits unattached, quietly generating charges. Multiply those scenarios across hundreds of engineers and thousands of resources, and you've got a waste problem that no amount of monthly billing reviews can keep up with.
This is where Cloud Custodian comes in. It's an open-source, CNCF Incubating project that lets you define cost governance policies in simple YAML files and enforces them automatically — across AWS, Azure, and GCP. Instead of writing ad hoc cleanup scripts, you describe the state you want (no unattached volumes, no untagged instances, no dev environments running at 2 AM) and Custodian handles the rest.
In this guide, we'll install Cloud Custodian, write practical cost optimization policies for all three major clouds, deploy them as serverless functions for continuous enforcement, and set up multi-account governance with c7n-org. Every policy here is production-tested and ready to deploy.
What Is Cloud Custodian?
Cloud Custodian (also known as c7n) is a rules engine for cloud security, cost optimization, and governance. Originally created at Capital One and now a CNCF Incubating project under the Apache 2.0 license, it's been adopted by thousands of organizations including Intuit, Microsoft, Amazon, Procter & Gamble, and Cox Automotive.
What makes Custodian unique is its declarative, YAML-based policy language. Each policy has three parts:
- Resource — the cloud resource type to target (e.g.,
aws.ec2,azure.vm,gcp.instance) - Filters — conditions that select which resources the policy applies to (age, tags, utilization metrics, configuration attributes)
- Actions — what to do with matched resources (stop, terminate, delete, tag, notify, snapshot)
Custodian supports over 500 resource types across AWS, Azure, GCP, Kubernetes, and OCI. For cost optimization specifically, it really shines at:
- Garbage collection — deleting unattached volumes, orphaned snapshots, unused Elastic IPs, and stale load balancers
- Off-hours scheduling — automatically stopping dev/test resources outside business hours
- Tag enforcement — identifying untagged resources and applying mark-for-deletion workflows
- Utilization-based cleanup — finding instances with consistently low CPU or network activity
- Right-sizing recommendations — flagging overprovisioned resources
Installation and Setup
Cloud Custodian is distributed as Python packages. The core package (c7n) covers AWS, with separate packages for Azure (c7n_azure) and GCP (c7n_gcp).
Install via pip
# Create a virtual environment (recommended)
python3 -m venv custodian
source custodian/bin/activate
# Install core (AWS support)
pip install c7n
# Install Azure support
pip install c7n_azure
# Install GCP support
pip install c7n_gcp
# Install the notification mailer (Slack, email, webhooks)
pip install c7n_mailer
# Verify installation
custodian version
Install via Docker
If you'd rather not mess with Python environments (totally fair), Docker works great:
# Pull the official image
docker pull cloudcustodian/c7n
# Run a policy using Docker
docker run -it \
-v $(pwd)/policies:/policies \
-v ~/.aws:/root/.aws \
cloudcustodian/c7n run -s /output /policies/cost-cleanup.yml
Authentication
Cloud Custodian uses your standard cloud provider credentials — nothing proprietary to configure:
- AWS: Reads from
~/.aws/credentials, environment variables (AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY), or IAM instance roles - Azure: Uses Azure CLI authentication (
az login), service principals, or managed identities - GCP: Uses Application Default Credentials or a service account JSON key via
GOOGLE_APPLICATION_CREDENTIALS
Essential AWS Cost Optimization Policies
The following policies target the most common sources of AWS cloud waste. Each one is a standalone YAML file you can validate, dry-run, and deploy immediately.
Delete Unattached EBS Volumes
Unattached EBS volumes are one of the most common forms of cloud waste — and one of the easiest wins. When an EC2 instance is terminated, its attached EBS volumes often persist, racking up storage charges with no workload to justify them. At $0.08/GB/month for gp3, a forgotten 500 GB volume costs $40/month doing absolutely nothing.
# aws-delete-unattached-ebs.yml
policies:
- name: delete-unattached-ebs-volumes
resource: aws.ebs
description: |
Find EBS volumes that have been unattached for more than 14 days
and delete them. A 14-day grace period prevents deleting volumes
that were intentionally detached for maintenance.
filters:
- type: value
key: State
value: available
- type: value
key: LastAttachTime
value_type: age
op: greater-than
value: 14
actions:
- type: delete
force: true
Remove Orphaned Snapshots
EBS snapshots accumulate silently. Automated backup scripts, AMI creation workflows, and manual snapshots all generate copies that persist long after the original volume or instance is gone. Snapshot storage costs $0.05/GB/month — and I've seen organizations with thousands of orphaned snapshots easily wasting $5,000–$10,000 monthly without realizing it.
# aws-cleanup-orphaned-snapshots.yml
policies:
- name: delete-orphaned-snapshots
resource: aws.ebs-snapshot
description: |
Delete snapshots older than 90 days whose source volume
no longer exists.
filters:
- type: age
days: 90
- type: value
key: VolumeId
value: vol-ffffffff
actions:
- delete
Release Unattached Elastic IPs
AWS charges $0.005/hour ($3.65/month) for each Elastic IP that's allocated but not associated with a running instance. It's a small amount per IP, but organizations with dozens of orphaned IPs see it add up surprisingly fast.
# aws-release-unused-eips.yml
policies:
- name: release-unused-elastic-ips
resource: aws.elastic-ip
description: |
Release Elastic IPs that are not associated with any instance.
filters:
- type: value
key: AssociationId
value: absent
actions:
- release
Stop Idle EC2 Instances
Instances with consistently low CPU utilization are strong candidates for stopping or right-sizing. This policy uses CloudWatch metrics to find instances averaging less than 5% CPU over the past 7 days. Note that it explicitly excludes production — you don't want to accidentally stop something important.
# aws-stop-idle-instances.yml
policies:
- name: stop-idle-ec2-instances
resource: aws.ec2
description: |
Stop running EC2 instances with average CPU utilization below
5% over the past 7 days. Tags them first for visibility.
filters:
- type: instance-state
value: running
- type: metrics
name: CPUUtilization
statistics: Average
days: 7
value: 5
op: less-than
- "tag:Environment": present
- not:
- "tag:Environment": production
actions:
- type: tag
key: CustodianAction
value: "stopped-idle-instance"
- stop
Off-Hours Scheduling for Dev/Test Environments
This is honestly one of the highest-impact Custodian policies you can deploy. Dev/test environments that run 24/7 but are only used during business hours waste roughly 70% of their compute cost. That's a lot of money left on the table. Off-hours policies stop instances outside working hours and start them again in the morning.
# aws-offhours-dev-stop.yml
policies:
- name: offhours-stop-dev-instances
resource: aws.ec2
description: |
Stop dev/test instances at 7 PM ET on weekdays and all day
on weekends. Instances must be tagged with
OffHours: "off=(M-F,19);on=(M-F,7);tz=et"
filters:
- type: offhour
default_tz: et
offhour: 19
tag: OffHours
actions:
- stop
- name: onhours-start-dev-instances
resource: aws.ec2
description: |
Start dev/test instances at 7 AM ET on weekdays.
filters:
- type: onhour
default_tz: et
onhour: 7
tag: OffHours
actions:
- start
To enroll an instance in off-hours scheduling, just add this tag:
Key: OffHours
Value: off=(M-F,19);on=(M-F,7);tz=et
What I like about this approach is that it gives individual teams control — they opt into scheduling by tagging their resources, rather than having a blanket policy imposed on them. For a team with 10 dev instances averaging $150/month each, off-hours scheduling saves roughly $1,050/month.
Tag Enforcement with Mark-and-Reap
Untagged resources are basically ungovernable. If you can't attribute a resource to a team or project, you can't hold anyone accountable for its cost. Custodian's mark-and-reap pattern gives resource owners a grace period to add required tags before the resource gets stopped or terminated.
# aws-tag-enforcement.yml
policies:
- name: tag-compliance-mark
resource: aws.ec2
description: |
Mark running instances missing required tags. They have 4 days
to add the tags before being stopped.
filters:
- type: instance-state
value: running
- or:
- "tag:Owner": absent
- "tag:Environment": absent
- "tag:CostCenter": absent
actions:
- type: mark-for-op
tag: custodian_cleanup
op: stop
days: 4
- type: notify
template: default.html
subject: "[Action Required] EC2 instance missing required tags"
to:
- resource-owner
transport:
type: sqs
queue: https://sqs.us-east-1.amazonaws.com/123456789012/custodian-mailer
- name: tag-compliance-reap
resource: aws.ec2
description: |
Stop instances that were marked 4+ days ago and still lack
required tags.
filters:
- type: marked-for-op
tag: custodian_cleanup
op: stop
actions:
- stop
Azure Cost Optimization Policies
Cloud Custodian's Azure support (c7n_azure) covers Virtual Machines, Managed Disks, SQL Databases, Storage Accounts, and dozens of other resource types. So, let's look at the most impactful cost policies for Azure.
Delete Unattached Managed Disks
# azure-delete-unattached-disks.yml
policies:
- name: delete-unattached-azure-disks
resource: azure.disk
description: |
Delete Azure Managed Disks that are not attached to any VM
and have been in the unattached state for more than 14 days.
filters:
- type: value
key: properties.diskState
value: Unattached
- type: value
key: properties.timeCreated
value_type: age
op: greater-than
value: 14
actions:
- type: delete
Stop Idle Azure VMs
# azure-stop-idle-vms.yml
policies:
- name: stop-idle-azure-vms
resource: azure.vm
description: |
Deallocate Azure VMs with average CPU below 5% over
the past 7 days in non-production resource groups.
filters:
- type: instance-view
key: statuses[].code
op: in
value_type: swap
value: PowerState/running
- type: metric
metric: Percentage CPU
aggregation: average
timeframe: 168
op: less-than
threshold: 5
- type: value
key: resourceGroup
op: regex
value: "(?i).*(dev|staging|test|sandbox).*"
actions:
- type: tag
tags:
CustodianAction: stopped-idle-vm
- type: poweroff
Auto-Tag Azure Resources with Creator
This one is really handy for cost attribution — it automatically tags VMs with whoever created them:
# azure-auto-tag-creator.yml
policies:
- name: azure-auto-tag-creator
resource: azure.vm
description: |
Automatically tag VMs with the email address of the user
who created them. Enables cost attribution by creator.
mode:
type: azure-event-grid
events:
- resourceProvider: Microsoft.Compute
event: write
actions:
- type: auto-tag-user
tag: CreatorEmail
days: 1
GCP Cost Optimization Policies
The GCP provider (c7n_gcp) supports Compute Engine instances, persistent disks, snapshots, Cloud SQL, GKE clusters, and more. The syntax stays consistent across clouds, which is one of Custodian's biggest strengths.
Stop Old GCP Instances
# gcp-stop-old-instances.yml
policies:
- name: stop-old-gcp-instances
resource: gcp.instance
description: |
Stop Compute Engine instances older than 30 days that lack
the 'keep' label. Targets forgotten dev/test VMs.
filters:
- type: value
key: creationTimestamp
value_type: age
op: greater-than
value: 30
- not:
- type: value
key: labels.keep
value: "true"
actions:
- type: stop
Delete Unused GCP Persistent Disks
# gcp-delete-unused-disks.yml
policies:
- name: delete-unused-gcp-disks
resource: gcp.disk
description: |
Delete persistent disks that are not attached to any instance
and were created more than 14 days ago.
filters:
- type: value
key: users
value: absent
- type: value
key: creationTimestamp
value_type: age
op: greater-than
value: 14
actions:
- type: delete
Running Policies: Dry-Run, Validate, Execute
Cloud Custodian gives you a safe workflow for testing policies before they touch anything. Always follow this sequence — seriously, don't skip the dry-run step.
Step 1: Validate the YAML Syntax
# Catch syntax errors and invalid resource/filter references
custodian validate aws-delete-unattached-ebs.yml
Step 2: Dry-Run to See Matched Resources
# Dry-run executes filters but skips actions
# Output goes to the specified directory
custodian run --dryrun -s /tmp/custodian-output aws-delete-unattached-ebs.yml
# Check how many resources matched
cat /tmp/custodian-output/delete-unattached-ebs-volumes/resources.json | python3 -m json.tool | head -20
Review the resources.json file carefully. It contains the full details of every resource that would be affected. Dry-running in a staging account first is always a good idea — it costs nothing and prevents costly mistakes.
Step 3: Execute the Policy
# Run for real — this will take actions on matched resources
custodian run -s /tmp/custodian-output aws-delete-unattached-ebs.yml
Deploying Policies as Serverless Functions
Running policies manually is fine for one-off cleanups, but real cost governance requires continuous enforcement. The good news? Custodian can deploy policies as serverless functions that run on a schedule or react to cloud events in real time.
AWS Lambda (Periodic Mode)
Add a mode block to deploy the policy as a Lambda function that runs on a cron schedule:
# aws-lambda-offhours.yml
policies:
- name: lambda-stop-idle-instances
resource: aws.ec2
mode:
type: periodic
schedule: "rate(1 hour)"
role: arn:aws:iam::{account_id}:role/CloudCustodianRole
timeout: 300
memory: 256
filters:
- type: offhour
default_tz: et
offhour: 19
actions:
- stop
When you run custodian run with a mode block, Custodian automatically creates the Lambda function, the CloudWatch Events rule, and the necessary IAM permissions. The {account_id} placeholder gets replaced at runtime, which makes the policy reusable across accounts. Pretty slick.
AWS CloudTrail (Event-Driven Mode)
For policies that need to react instantly — like tagging newly created resources — use CloudTrail mode:
# aws-cloudtrail-tag-on-create.yml
policies:
- name: auto-tag-ec2-creator
resource: aws.ec2
mode:
type: cloudtrail
role: arn:aws:iam::{account_id}:role/CloudCustodianRole
events:
- RunInstances
actions:
- type: auto-tag-user
tag: CreatedBy
Azure Functions (Event Grid Mode)
# azure-event-grid-tag.yml
policies:
- name: azure-tag-vm-creator
resource: azure.vm
mode:
type: azure-event-grid
events:
- resourceProvider: Microsoft.Compute
event: write
actions:
- type: auto-tag-user
tag: CreatedBy
days: 1
GCP Cloud Functions (Audit Log Mode)
# gcp-audit-log-policy.yml
policies:
- name: gcp-tag-instance-creator
resource: gcp.instance
mode:
type: gcp-audit
methods:
- compute.instances.insert
actions:
- type: set-labels
labels:
created-by: "{user}"
Multi-Account Governance with c7n-org
Most organizations run dozens or even hundreds of cloud accounts. Cloud Custodian includes c7n-org, a multi-account orchestration tool that runs policies across all accounts in an AWS Organization, Azure subscriptions, or GCP projects from a single command.
Setting Up c7n-org for AWS
# Install c7n-org
pip install c7n_org
# Create an accounts config file (accounts.yml)
# c7n-org can auto-generate this from AWS Organizations:
c7n-org org-accounts --output accounts.yml
The generated accounts.yml lists all accounts with their IDs and the IAM role to assume:
# accounts.yml
accounts:
- name: production
account_id: "111111111111"
role: arn:aws:iam::111111111111:role/CloudCustodianRole
regions:
- us-east-1
- eu-west-1
- name: development
account_id: "222222222222"
role: arn:aws:iam::222222222222:role/CloudCustodianRole
regions:
- us-east-1
- name: staging
account_id: "333333333333"
role: arn:aws:iam::333333333333:role/CloudCustodianRole
regions:
- us-east-1
Running Policies Across All Accounts
# Dry-run a policy across all accounts
c7n-org run -c accounts.yml -s /tmp/org-output \
-u aws-delete-unattached-ebs.yml --dryrun
# Execute the policy across all accounts
c7n-org run -c accounts.yml -s /tmp/org-output \
-u aws-delete-unattached-ebs.yml
# Review results per account
ls /tmp/org-output/
# Output: production/ development/ staging/
The output is organized by account name, so you can review which resources were matched and acted on in each account. For enterprise deployments, pipe the output to an S3 bucket for centralized reporting and integrate with your FinOps dashboards.
CI/CD Pipeline Integration
Treating Cloud Custodian policies as code means they should go through the same review and deployment pipeline as your infrastructure code. Here's a GitHub Actions workflow that validates, dry-runs, and deploys Custodian policies on every commit to the main branch.
# .github/workflows/custodian-deploy.yml
name: Deploy Cloud Custodian Policies
on:
push:
branches: [main]
paths: ['policies/**']
pull_request:
branches: [main]
paths: ['policies/**']
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install Cloud Custodian
run: pip install c7n c7n_azure c7n_gcp
- name: Validate all policies
run: |
for f in policies/*.yml; do
echo "Validating $f..."
custodian validate "$f"
done
dryrun:
needs: validate
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
permissions:
id-token: write
contents: read
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.CUSTODIAN_ROLE_ARN }}
aws-region: us-east-1
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install Cloud Custodian
run: pip install c7n
- name: Dry-run changed policies
run: |
for f in policies/aws-*.yml; do
echo "Dry-running $f..."
custodian run --dryrun -s /tmp/output "$f"
done
- name: Report matched resources
run: |
echo "## Custodian Dry-Run Results" >> $GITHUB_STEP_SUMMARY
for dir in /tmp/output/*/; do
policy=$(basename "$dir")
count=$(cat "$dir/resources.json" | python3 -c \
"import sys,json; print(len(json.load(sys.stdin)))")
echo "- **$policy**: $count resources matched" >> $GITHUB_STEP_SUMMARY
done
deploy:
needs: validate
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
runs-on: ubuntu-latest
permissions:
id-token: write
contents: read
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.CUSTODIAN_ROLE_ARN }}
aws-region: us-east-1
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install Cloud Custodian
run: pip install c7n c7n_org
- name: Deploy policies to all accounts
run: |
c7n-org run -c accounts.yml \
-s /tmp/deploy-output \
-u policies/aws-lambda-offhours.yml
On pull requests, the pipeline validates and dry-runs policies — posting the number of matched resources as a PR summary. Engineers can review the blast radius before policies go live. On merge to main, the deploy job pushes Lambda-mode policies across all accounts.
Cloud Custodian vs Native Cloud Tools
Every cloud provider has its own governance tools. So when should you reach for Custodian instead of — or alongside — native options?
| Capability | Cloud Custodian | AWS Config Rules | Azure Policy | GCP Org Policies |
|---|---|---|---|---|
| Multi-cloud support | AWS, Azure, GCP, K8s, OCI | AWS only | Azure only | GCP only |
| Policy language | YAML (declarative) | JSON / Lambda (custom) | JSON (ARM) | JSON (constraints) |
| Automated remediation | Built-in actions | Via SSM Automation | DeployIfNotExists | Limited |
| Pre-deployment blocking | No (post-deploy) | No (post-deploy) | Yes (deny effect) | Yes (constraints) |
| Cost | Free (open source) | $0.003/evaluation | Free (built-in) | Free (built-in) |
| Utilization metrics | CloudWatch / Monitor | Not directly | Not directly | Not directly |
| Off-hours scheduling | Built-in filter | Custom Lambda | Azure Automation | Cloud Scheduler |
The pragmatic approach is to use them together. Use Azure Policy or GCP Organization Policies to block non-compliant deployments at creation time. Use Cloud Custodian for everything that happens after — utilization monitoring, off-hours scheduling, garbage collection, and cross-cloud governance under a single policy framework.
For AWS specifically, Cloud Custodian can deploy policies as AWS Config Rules (using mode: type: config-rule), giving you the best of both worlds — Custodian's expressive YAML syntax with Config's native change-tracking and compliance dashboard.
Real-World Cost Savings Benchmarks
So what kind of savings can you actually expect? Here's what organizations typically see with a well-implemented Custodian deployment:
| Policy Category | Typical Monthly Savings | Implementation Effort |
|---|---|---|
| Off-hours scheduling (dev/test) | 50–70% of non-prod compute | Low (1–2 hours) |
| Unattached volume cleanup | $500–$5,000 | Low (30 minutes) |
| Orphaned snapshot cleanup | $1,000–$10,000 | Low (30 minutes) |
| Idle instance detection | $2,000–$20,000 | Medium (1–2 hours) |
| Tag enforcement (mark-and-reap) | Enables all other savings | Medium (2–4 hours) |
| Multi-account deployment | Multiplies all savings | High (1–2 days initial) |
For a mid-sized organization spending $200,000/month across AWS, Azure, and GCP, a comprehensive Custodian deployment targeting these areas typically reduces the bill by $30,000–$60,000/month — a 15–30% reduction with minimal ongoing maintenance. That's not a theoretical number; those are the kinds of results teams consistently report after rolling out even a basic set of policies.
FAQ
Is Cloud Custodian free to use?
Yes. Cloud Custodian is fully open-source under the Apache 2.0 license and is a CNCF Incubating project. There are no license fees or per-resource charges. You only pay for the cloud resources Custodian uses to run (Lambda invocations, Cloud Functions executions, etc.), which typically costs under $10/month even for large deployments. Stacklet, the company behind Custodian, offers a commercial SaaS platform for enterprises that want a managed UI, but the core engine is completely free.
Can Cloud Custodian accidentally delete production resources?
This is the most common concern (and a fair one), but Custodian has multiple safeguards. First, every policy should be validated (custodian validate) and dry-run (custodian run --dryrun) before live execution. The dry-run shows exactly which resources match without taking any action. Second, well-written cost policies should include filters that explicitly exclude production — filtering by tags like Environment: production, or by resource group naming conventions. Third, the mark-and-reap pattern adds a grace period (typically 3–7 days) before destructive actions, giving teams time to add missing tags or flag exceptions.
How does Cloud Custodian compare to Terraform and Infracost?
They solve different halves of the same problem. Terraform and Infracost focus on pre-deployment cost governance — estimating costs before resources are provisioned and blocking expensive changes in pull requests. Cloud Custodian focuses on post-deployment governance — monitoring running resources, cleaning up waste, and enforcing policies continuously. A mature FinOps practice uses both: Infracost in the CI/CD pipeline to prevent cost surprises, and Custodian in production to catch waste that accumulates over time.
Does Cloud Custodian work with Kubernetes?
Yes. Cloud Custodian has a Kubernetes provider that can enforce policies on pods, deployments, services, namespaces, and other Kubernetes resources. You can write policies to identify pods without resource limits, namespaces without cost labels, or deployments with excessive replica counts. That said, for Kubernetes-specific cost optimization (node right-sizing, pod autoscaling), dedicated tools like Karpenter, KEDA, or Kubecost may be more appropriate. Custodian really excels at the infrastructure layer beneath Kubernetes.
How long does it take to set up Cloud Custodian for a multi-account organization?
A basic single-account deployment with 3–5 cost policies takes about 2–4 hours including testing. For multi-account deployments using c7n-org, plan for 1–2 days of initial setup: creating cross-account IAM roles, configuring the accounts file, and setting up the CI/CD pipeline. After the initial setup, adding new policies is fast — most take 15–30 minutes to write, validate, and deploy. Most organizations start with off-hours scheduling and unattached volume cleanup (the highest ROI, lowest risk policies) and gradually expand coverage from there.