Your development, staging, and test environments are quietly bleeding money. While your team's asleep, those non-production workloads are running at full tilt — racking up charges for compute, databases, and networking that literally nobody is using. Industry data suggests organizations waste around 32% of their cloud spend on average, and always-on dev/test environments are one of the biggest culprits.
Here's the good news: it's a very fixable problem.
Automated scheduling, right-sizing, and scale-to-zero policies can slash non-production cloud costs by 60–70% across AWS, Azure, and GCP. I'm going to walk you through exactly how to implement these savings with working Terraform configurations, CLI commands, and Kubernetes manifests you can deploy today.
The True Cost of Always-On Dev/Test Environments
A typical development environment runs 168 hours per week. If your engineering team works 10 hours a day, five days a week, the environment is actually being used for roughly 50 hours — just 30% of the time. You're paying full price for the other 70%.
Let that sink in for a second.
Here's what that looks like in real dollar terms across the three major clouds:
- AWS: A modest dev stack — two m6i.large EC2 instances, one db.t3.medium RDS instance, and an Application Load Balancer — costs about $350/month running 24/7. With scheduling, that drops to around $105/month.
- Azure: Two Standard_D2s_v5 VMs plus an Azure SQL Basic tier run about $320/month. With auto-shutdown policies, you're paying closer to $96/month.
- GCP: Two e2-standard-2 VMs and a Cloud SQL db-f1-micro cost roughly $290/month. Scheduled stop/start brings that to approximately $87/month.
Now multiply those numbers by the number of dev teams, feature branch environments, and QA stages in your organization. The waste adds up to thousands — sometimes tens of thousands — of dollars per month. I've seen mid-sized companies discover $15K+ in monthly savings just from this one change.
AWS: Automated Instance Scheduling with EventBridge and Lambda
AWS gives you several ways to schedule non-production resources. The most flexible (and honestly, the most production-ready) approach uses Amazon EventBridge rules to trigger a Lambda function that starts and stops tagged instances on a schedule.
Architecture Overview
The setup is pretty straightforward: EventBridge fires two cron-based rules — one to start instances in the morning, another to stop them in the evening. Each rule invokes a Lambda function that filters EC2 and RDS instances by a specific tag (like AutoSchedule = true) and performs the start or stop action.
Terraform Configuration
Here's a complete Terraform configuration that deploys the scheduling infrastructure:
## --- Lambda Function for Start/Stop ---
resource "aws_iam_role" "scheduler_lambda_role" {
name = "dev-env-scheduler-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "lambda.amazonaws.com" }
}]
})
}
resource "aws_iam_role_policy" "scheduler_lambda_policy" {
name = "dev-env-scheduler-policy"
role = aws_iam_role.scheduler_lambda_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"ec2:StartInstances", "ec2:StopInstances",
"ec2:DescribeInstances",
"rds:StartDBInstance", "rds:StopDBInstance",
"rds:DescribeDBInstances"
]
Resource = "*"
Condition = {
StringEquals = {
"aws:ResourceTag/AutoSchedule" = "true"
}
}
},
{
Effect = "Allow"
Action = ["ec2:DescribeInstances", "rds:DescribeDBInstances"]
Resource = "*"
},
{
Effect = "Allow"
Action = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"]
Resource = "arn:aws:logs:*:*:*"
}
]
})
}
## --- EventBridge Rules ---
resource "aws_cloudwatch_event_rule" "start_dev" {
name = "start-dev-environments"
description = "Start dev instances at 8 AM UTC Monday-Friday"
schedule_expression = "cron(0 8 ? * MON-FRI *)"
}
resource "aws_cloudwatch_event_rule" "stop_dev" {
name = "stop-dev-environments"
description = "Stop dev instances at 6 PM UTC Monday-Friday"
schedule_expression = "cron(0 18 ? * MON-FRI *)"
}
resource "aws_cloudwatch_event_target" "start_target" {
rule = aws_cloudwatch_event_rule.start_dev.name
target_id = "start-dev-lambda"
arn = aws_lambda_function.scheduler.arn
input = jsonencode({ action = "start" })
}
resource "aws_cloudwatch_event_target" "stop_target" {
rule = aws_cloudwatch_event_rule.stop_dev.name
target_id = "stop-dev-lambda"
arn = aws_lambda_function.scheduler.arn
input = jsonencode({ action = "stop" })
}
resource "aws_lambda_permission" "allow_start" {
statement_id = "AllowStartExecution"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.scheduler.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.start_dev.arn
}
resource "aws_lambda_permission" "allow_stop" {
statement_id = "AllowStopExecution"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.scheduler.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.stop_dev.arn
}
Lambda Function (Python)
The Lambda function handles both EC2 and RDS resources. Nothing fancy here — it just filters by tag and performs the action:
import boto3
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
ec2 = boto3.client("ec2")
rds = boto3.client("rds")
def lambda_handler(event, context):
action = event.get("action", "stop")
# Handle EC2 instances
filters = [
{"Name": "tag:AutoSchedule", "Values": ["true"]},
{"Name": "instance-state-name",
"Values": ["running"] if action == "stop" else ["stopped"]}
]
response = ec2.describe_instances(Filters=filters)
instance_ids = [
i["InstanceId"]
for r in response["Reservations"]
for i in r["Instances"]
]
if instance_ids:
if action == "stop":
ec2.stop_instances(InstanceIds=instance_ids)
logger.info(f"Stopped EC2: {instance_ids}")
else:
ec2.start_instances(InstanceIds=instance_ids)
logger.info(f"Started EC2: {instance_ids}")
# Handle RDS instances
rds_response = rds.describe_db_instances()
for db in rds_response["DBInstances"]:
tags = rds.list_tags_for_resource(
ResourceName=db["DBInstanceArn"]
)["TagList"]
tag_dict = {t["Key"]: t["Value"] for t in tags}
if tag_dict.get("AutoSchedule") != "true":
continue
if action == "stop" and db["DBInstanceStatus"] == "available":
rds.stop_db_instance(DBInstanceIdentifier=db["DBInstanceIdentifier"])
logger.info(f"Stopped RDS: {db['DBInstanceIdentifier']}")
elif action == "start" and db["DBInstanceStatus"] == "stopped":
rds.start_db_instance(DBInstanceIdentifier=db["DBInstanceIdentifier"])
logger.info(f"Started RDS: {db['DBInstanceIdentifier']}")
return {"statusCode": 200, "body": f"{action} completed"}
Quick CLI Alternative
If you just need something working before the full Terraform setup is ready, you can use the AWS CLI directly with a cron job. It's not as elegant, but it gets the job done:
# Stop all tagged dev instances (add to crontab for 6 PM)
aws ec2 describe-instances \
--filters "Name=tag:AutoSchedule,Values=true" \
"Name=instance-state-name,Values=running" \
--query "Reservations[].Instances[].InstanceId" \
--output text | xargs -r aws ec2 stop-instances --instance-ids
# Start all tagged dev instances (add to crontab for 8 AM)
aws ec2 describe-instances \
--filters "Name=tag:AutoSchedule,Values=true" \
"Name=instance-state-name,Values=stopped" \
--query "Reservations[].Instances[].InstanceId" \
--output text | xargs -r aws ec2 start-instances --instance-ids
Azure: Auto-Shutdown Policies and DevTest Labs
Azure actually makes this easier than the other clouds. It has first-class auto-shutdown support built right into the VM resource, so you don't need to wire up Lambda functions or Cloud Functions. You've got two main approaches: per-VM auto-shutdown schedules and Azure DevTest Labs for centralized management.
Per-VM Auto-Shutdown with Terraform
The azurerm_dev_test_global_vm_shutdown_schedule resource (bit of a mouthful, I know) lets you attach a shutdown schedule to any Azure VM — not just those in DevTest Labs:
resource "azurerm_dev_test_global_vm_shutdown_schedule" "dev_shutdown" {
virtual_machine_id = azurerm_linux_virtual_machine.dev_vm.id
location = azurerm_resource_group.dev_rg.location
enabled = true
daily_recurrence_time = "1800" # 6:00 PM
timezone = "UTC"
notification_settings {
enabled = true
time_in_minutes = 15
email = "[email protected]"
}
}
# Apply to multiple VMs using for_each
resource "azurerm_dev_test_global_vm_shutdown_schedule" "all_dev_vms" {
for_each = { for vm in azurerm_linux_virtual_machine.dev_vms : vm.name => vm }
virtual_machine_id = each.value.id
location = each.value.location
enabled = true
daily_recurrence_time = "1800"
timezone = "UTC"
notification_settings {
enabled = false
}
}
This is important: Auto-shutdown in Azure only stops VMs — it doesn't deallocate them by default. You'll still get billed for compute on a stopped (but not deallocated) VM. To actually stop the billing, you need to deallocate using Azure Automation runbooks or the CLI:
# Deallocate (not just stop) to avoid compute charges
az vm deallocate --resource-group dev-rg --name dev-vm-01
# Start the VM back up
az vm start --resource-group dev-rg --name dev-vm-01
# Bulk deallocate all VMs with a specific tag
az vm list --query "[?tags.Environment=='dev'].[resourceGroup, name]" -o tsv | \
while read rg name; do
az vm deallocate --resource-group "$rg" --name "$name" --no-wait
done
Azure Automation Runbook for Full Start/Stop
For a complete start-and-stop cycle (including morning startup), you'll want an Azure Automation account with two runbooks triggered by schedules. Azure provides a built-in Start/Stop VMs solution you can deploy from the Marketplace, but honestly, a custom runbook gives you way more control over tagging and notification logic.
Azure DevTest Labs for Centralized Control
If you're managing multiple dev/test environments, Azure DevTest Labs is worth a look. It provides a centralized policy layer where lab owners can set auto-shutdown and auto-start times, cap the number of VMs per user, restrict allowed VM sizes, and set cost thresholds with alerts. To put it in perspective: a dual-core VM with 4 GB RAM costs about $100/month running 24/7. Using DevTest Labs to limit usage to 50 hours/week drops that to under $30/month.
GCP: Cloud Scheduler with Cloud Functions
GCP doesn't have a built-in per-VM auto-shutdown feature like Azure does. Instead, you build a scheduling pipeline using Cloud Scheduler, Pub/Sub, and Cloud Functions. It's a bit more setup, but the upside is that it's also more flexible — you get label-based targeting and custom logic out of the box.
Architecture
Cloud Scheduler publishes a message to a Pub/Sub topic on a cron schedule. A Cloud Function subscribes to that topic, reads the desired action (start or stop) from the message, and uses the Compute Engine API to act on VMs matching specific labels. Simple enough.
Terraform Configuration
resource "google_pubsub_topic" "vm_scheduler" {
name = "vm-scheduler-topic"
}
resource "google_cloud_scheduler_job" "stop_dev_vms" {
name = "stop-dev-vms"
description = "Stop dev VMs at 6 PM UTC weekdays"
schedule = "0 18 * * 1-5"
time_zone = "UTC"
pubsub_target {
topic_name = google_pubsub_topic.vm_scheduler.id
data = base64encode(jsonencode({
action = "stop"
label = "env"
value = "dev"
}))
}
}
resource "google_cloud_scheduler_job" "start_dev_vms" {
name = "start-dev-vms"
description = "Start dev VMs at 8 AM UTC weekdays"
schedule = "0 8 * * 1-5"
time_zone = "UTC"
pubsub_target {
topic_name = google_pubsub_topic.vm_scheduler.id
data = base64encode(jsonencode({
action = "start"
label = "env"
value = "dev"
}))
}
}
Cloud Function (Python)
import base64
import json
from googleapiclient import discovery
compute = discovery.build("compute", "v1")
def scheduler_handler(event, context):
payload = json.loads(base64.b64decode(event["data"]).decode("utf-8"))
action = payload["action"]
label_key = payload["label"]
label_value = payload["value"]
project = "your-project-id"
zones = compute.zones().list(project=project).execute()
for zone in zones.get("items", []):
zone_name = zone["name"]
instances = compute.instances().list(
project=project, zone=zone_name,
filter=f"labels.{label_key}={label_value}"
).execute()
for instance in instances.get("items", []):
name = instance["name"]
status = instance["status"]
if action == "stop" and status == "RUNNING":
compute.instances().stop(
project=project, zone=zone_name, instance=name
).execute()
print(f"Stopped {name} in {zone_name}")
elif action == "start" and status == "TERMINATED":
compute.instances().start(
project=project, zone=zone_name, instance=name
).execute()
print(f"Started {name} in {zone_name}")
GCP Native Alternative: Instance Schedules
For simpler use cases, GCP offers google_compute_resource_policy with an instance schedule — no Cloud Functions needed. This is probably where you should start if you just want basic stop/start on a timer:
resource "google_compute_resource_policy" "dev_schedule" {
name = "dev-vm-schedule"
region = "us-central1"
instance_schedule_policy {
vm_start_schedule {
schedule = "0 8 * * 1-5" # 8 AM UTC Mon-Fri
}
vm_stop_schedule {
schedule = "0 18 * * 1-5" # 6 PM UTC Mon-Fri
}
time_zone = "UTC"
}
}
# Attach the policy to a VM
resource "google_compute_instance" "dev_vm" {
name = "dev-instance-01"
machine_type = "e2-standard-2"
zone = "us-central1-a"
resource_policies = [google_compute_resource_policy.dev_schedule.id]
boot_disk {
initialize_params {
image = "debian-cloud/debian-12"
}
}
network_interface {
network = "default"
}
labels = {
env = "dev"
}
}
Kubernetes: Scale Dev Namespaces to Zero
If your dev/test workloads run on Kubernetes (EKS, AKS, or GKE), you've got an additional lever: scaling deployments and node pools to zero during off-hours. This approach can save 60–73% on dev cluster costs, which is significant when you consider how expensive managed K8s clusters can get.
CronJob-Based Deployment Scaling
The simplest cloud-agnostic approach uses Kubernetes CronJobs with a service account that has permission to scale deployments. I like this approach because it works the same way regardless of which cloud you're on:
apiVersion: v1
kind: ServiceAccount
metadata:
name: namespace-scaler
namespace: dev
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: deployment-scaler
namespace: dev
rules:
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets"]
verbs: ["get", "list", "patch", "update"]
- apiGroups: ["apps"]
resources: ["deployments/scale", "statefulsets/scale"]
verbs: ["get", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: deployment-scaler-binding
namespace: dev
subjects:
- kind: ServiceAccount
name: namespace-scaler
roleRef:
kind: Role
name: deployment-scaler
apiGroup: rbac.authorization.k8s.io
---
# Scale DOWN at 6 PM UTC weekdays
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-down-dev
namespace: dev
spec:
schedule: "0 18 * * 1-5"
jobTemplate:
spec:
template:
spec:
serviceAccountName: namespace-scaler
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
# Save current replica counts as annotations
for deploy in $(kubectl get deployments -n dev -o name); do
replicas=$(kubectl get $deploy -n dev -o jsonpath='{.spec.replicas}')
kubectl annotate $deploy -n dev scheduler/original-replicas=$replicas --overwrite
kubectl scale $deploy -n dev --replicas=0
done
restartPolicy: OnFailure
---
# Scale UP at 8 AM UTC weekdays
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-up-dev
namespace: dev
spec:
schedule: "0 8 * * 1-5"
jobTemplate:
spec:
template:
spec:
serviceAccountName: namespace-scaler
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
for deploy in $(kubectl get deployments -n dev -o name); do
replicas=$(kubectl get $deploy -n dev -o jsonpath='{.metadata.annotations.scheduler/original-replicas}')
if [ -n "$replicas" ] && [ "$replicas" != "null" ]; then
kubectl scale $deploy -n dev --replicas=$replicas
else
kubectl scale $deploy -n dev --replicas=1
fi
done
restartPolicy: OnFailure
KEDA for Event-Driven Scale-to-Zero
KEDA (Kubernetes Event-Driven Autoscaler) takes a more sophisticated approach. It can scale deployments to zero based on external metrics — things like HTTP request count, queue depth, or cron schedules. If you're already running KEDA in your cluster, this is probably the cleanest option:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: dev-api-scaler
namespace: dev
spec:
scaleTargetRef:
name: dev-api
minReplicaCount: 0 # Scale to zero when idle
maxReplicaCount: 5
cooldownPeriod: 300 # Wait 5 min before scaling to zero
triggers:
- type: cron
metadata:
timezone: UTC
start: "0 8 * * 1-5" # Scale up at 8 AM
end: "0 18 * * 1-5" # Scale down at 6 PM
desiredReplicas: "2"
Node Pool Scheduling
Here's something people often miss: scaling pods to zero only saves money if the underlying nodes also scale down. Make sure your cluster autoscaler is configured to remove empty nodes. On GKE, you can also use the scheduled autoscaler to directly control node pool size:
# GKE: Scale dev node pool to zero at night
gcloud container clusters update my-cluster \
--node-pool dev-pool \
--enable-autoscaling \
--min-nodes 0 --max-nodes 5
# EKS: Scale managed node group to zero
aws eks update-nodegroup-config \
--cluster-name my-cluster \
--nodegroup-name dev-nodegroup \
--scaling-config minSize=0,maxSize=5,desiredSize=0
Tagging Strategy for Environment Identification
None of this works without consistent tagging. Seriously — this is the part that makes or breaks the whole approach. Every non-production resource needs to be tagged so your automation can identify and target it correctly.
Minimum Required Tags
- Environment:
dev,staging,qa,prod— identifies the environment tier - AutoSchedule:
trueorfalse— explicit opt-in for scheduling automation - Team: The owning team — useful for chargeback and notification routing
- ScheduleOverride: Optional — set to
always-onfor environments that genuinely need to stay running (like overnight integration test suites)
Enforce Tags with Terraform
# Require tags on all EC2 instances
variable "required_tags" {
default = {
Environment = "dev"
AutoSchedule = "true"
Team = "engineering"
}
}
resource "aws_instance" "dev_server" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.medium"
tags = merge(var.required_tags, {
Name = "dev-api-server"
})
lifecycle {
precondition {
condition = contains(["dev", "staging", "qa"], var.required_tags["Environment"])
error_message = "Non-production instances must have a valid Environment tag."
}
}
}
Monitoring and Measuring Your Savings
After you've implemented scheduling, you'll want to track your actual savings. This validates the ROI and — just as importantly — catches any resources that somehow slip through the cracks.
AWS Cost Explorer Query
# Compare non-prod costs before and after scheduling
aws ce get-cost-and-usage \
--time-period Start=2026-02-01,End=2026-03-01 \
--granularity MONTHLY \
--metrics "UnblendedCost" \
--filter '{
"Tags": {
"Key": "Environment",
"Values": ["dev", "staging", "qa"]
}
}' \
--group-by Type=TAG,Key=Environment
Azure Cost Analysis
# Query non-prod costs by resource group
az cost query --type ActualCost \
--timeframe MonthToDate \
--dataset-filter "{\"tags\": {\"Environment\": [\"dev\", \"staging\", \"qa\"]}}" \
--dataset-grouping name=ResourceType type=Dimension
Key Metrics to Track
- Non-production spend ratio: Target less than 20% of total cloud spend going to non-prod environments
- Scheduling compliance: Percentage of non-prod resources with
AutoSchedule = true— aim for 95%+ - Off-hours utilization: Any non-prod compute usage outside business hours should be near zero
- Month-over-month savings: Track the actual dollar reduction after implementing scheduling
Common Pitfalls and How to Avoid Them
I've seen teams trip over these issues more times than I can count. Save yourself the headache:
- Database state loss: RDS instances stopped for more than 7 days get automatically restarted by AWS. It's one of those "features" that catches everyone off guard. Set up a Lambda to re-stop them, or consider Aurora Serverless v2 which scales to near-zero automatically.
- Elastic IPs still billing: Elastic IPs attached to stopped EC2 instances still incur charges. Release unused EIPs or only associate them with running instances.
- Azure stopped vs. deallocated: Stopping a VM in Azure via the OS doesn't deallocate it — you're still paying for compute. Always use
az vm deallocateor the portal's Stop button to fully release the resource. - Persistent disk costs: Disks attached to stopped VMs still incur storage charges on all three clouds. For environments that are off for extended periods (weekends, holidays), consider snapshotting and deleting the disks.
- DNS and service discovery: If your dev services register with service discovery (Consul, Cloud Map), make sure they deregister on shutdown. Otherwise you'll get routing errors that are annoyingly hard to debug.
- Timezone mismatches: Use UTC for all schedules to avoid daylight saving time surprises. Trust me on this one — convert to local time only in notification messages.
Frequently Asked Questions
How much can I save by shutting down dev environments at night?
Most organizations save between 50% and 70% on non-production compute costs with automated shutdown schedules. The exact number depends on your working hours pattern. A typical Monday-through-Friday, 8 AM to 6 PM schedule eliminates roughly 70% of running hours. Factor in weekends and the savings go even higher. Real-world reports consistently show $1,000–$5,000+ monthly savings per team.
Will shutting down dev environments affect my CI/CD pipelines?
Only if your pipelines depend on always-on dev infrastructure. The better approach is using ephemeral environments for CI/CD — spin up resources at pipeline start, run tests, and tear everything down when complete. For pipelines that need a persistent environment, add a pre-pipeline step that starts the required resources, waits for them to become healthy, runs the pipeline, then triggers shutdown. Most CI/CD tools (GitHub Actions, GitLab CI, Jenkins) support pre/post pipeline hooks for exactly this.
What's the difference between stopping and deallocating a VM on Azure?
This one trips people up constantly. Stopping a VM via the guest OS (or az vm stop) halts the operating system but keeps the compute allocation reserved — you're still paying for the VM compute hours. Deallocating (via az vm deallocate or the portal Stop button) fully releases the compute resources, so you only pay for storage. Always deallocate non-production VMs to get real cost savings.
Can I schedule Kubernetes clusters to scale to zero?
Yes, absolutely. You can scale deployments and StatefulSets to zero replicas using CronJobs or KEDA, and configure the cluster autoscaler to remove empty nodes. On managed Kubernetes services (EKS, AKS, GKE), you can also scale node pools to zero directly. Worth noting: Kubernetes 1.36 introduced native HPA support for minReplicas: 0, making scale-to-zero a first-class feature without requiring KEDA for simple cron-based scheduling.
How do I handle environments that need to run overnight for batch jobs or integration tests?
Use the ScheduleOverride tag (set to always-on or a custom schedule name) to exclude specific resources from the default shutdown. Alternatively, restructure long-running jobs to use spot instances or serverless compute (AWS Batch, Azure Batch, GCP Cloud Run jobs) that only cost money while actually processing. For nightly integration suites, the smartest move is starting the environment right when tests kick off and shutting down immediately after — rather than keeping everything running all night "just in case."