Before guessing, get the numbers. I open Cost Explorer and do exactly four things, in order. This takes about ten minutes and tells me roughly 80% of what I need to know.
First, I switch the granularity to Daily and the date range to the last 30 days. Monthly views hide the spike — daily views show me the exact day the slope changed. If the jump is recent enough you can switch to Hourly (it has to be enabled in Billing preferences, and it only retains 14 days), which is invaluable for catching the exact hour something kicked off.
Second, I group by Service. Ninety percent of the time, one or two services account for the entire delta. Write down the top three offenders.
Third, I re-filter to the worst offender and re-group by Usage Type. This is where the real story lives. "EC2 — Other" is the classic gotcha: it bundles NAT Gateway data processing, EBS, elastic IPs and a lot more. "Usage Type" splits those apart into things like NatGateway-Bytes, EBS:SnapshotUsage, DataTransfer-Out-Bytes.
Fourth, I group by Resource (you have to enable resource-level data in Billing settings — do it now if you haven't, it is free and retroactive up to 14 days). This gives you the actual ARNs that are bleeding money.
If I want to script this rather than click around, I use the Cost Explorer API directly. Here is the bash one-liner I keep in my dotfiles:
aws ce get-cost-and-usage \
--time-period Start=2026-05-15,End=2026-05-31 \
--granularity DAILY \
--metrics UnblendedCost \
--group-by Type=DIMENSION,Key=SERVICE \
--query 'ResultsByTime[].{Date:TimePeriod.Start,Groups:Groups[?Metrics.UnblendedCost.Amount > `50`].[Keys[0],Metrics.UnblendedCost.Amount]}' \
--output table
That filters out the noise floor (anything under $50/day per service) so you only see the things that actually matter. Run it, look at which service's row got fatter, and move on to the suspect list below.
Suspect 1: NAT Gateway data processing
This is the number one culprit in my notes. NAT Gateway charges $0.045 per GB processed (us-east-1, as of 2026) on top of the hourly fee. A single misconfigured pod that pulls a 2 GB Docker image through a NAT every minute will quietly cost you around $130/day. I have seen this one twice this year alone — once it was a CI runner in a private subnet pulling images from Docker Hub instead of ECR, and once it was a CloudWatch agent looping on a malformed log shipper config.
The tell in Cost Explorer is a line item called NatGateway-Bytes under EC2 — Other. To find the actual culprit, enable VPC Flow Logs on the NAT's ENI and query them with Athena:
SELECT srcaddr, dstaddr, SUM(bytes) / 1024 / 1024 / 1024 AS gb
FROM vpc_flow_logs
WHERE day = '2026-05-30'
AND action = 'ACCEPT'
AND interface_id = 'eni-0abc123def456'
GROUP BY srcaddr, dstaddr
ORDER BY gb DESC
LIMIT 20;
The fix is almost always a VPC Gateway Endpoint (free) for S3 and DynamoDB, or an Interface Endpoint ($0.01/hour but no data charges) for ECR, Secrets Manager, CloudWatch, and so on. If the workload was pulling from S3 through a NAT, adding a single gateway endpoint can drop the bill by 60% the same day.
Suspect 2: CloudWatch Logs ingestion
CloudWatch is sneaky because the ingest charge ($0.50/GB in us-east-1) doesn't appear on most dashboards. A noisy debug logger left on after a deploy can dump terabytes a day. The line item to look for is DataProcessing-Bytes or just the "CloudWatch" service jumping by 5x.
The fastest way to find the offender:
aws logs describe-log-groups --query 'logGroups[].[logGroupName,storedBytes]' \
--output text | sort -k2 -n -r | head -20
Then divide storedBytes by the retention window to estimate daily ingest. If a single log group is doing >10 GB/day you almost certainly have a runaway logger. I usually find it is one of: an APM agent in debug mode, an EKS container logging every health-check request at INFO level, or a Lambda function logging the entire event payload on every invocation.
Two fixes that compound: set a retention policy on every log group (the default is "Never expire" and it will haunt you), and add a subscription filter that routes high-volume logs to S3 instead. S3 storage is roughly 50x cheaper than CloudWatch.
Suspect 3: EBS snapshots accumulating forever
If your bill is creeping up linearly month over month rather than spiking, this is usually it. EBS snapshots are incremental on the wire but get billed for the actual blocks they reference, and AWS Backup plans default to retaining things for years. I worked on one account in March 2026 that had 14,000 snapshots from a deleted dev environment racking up $1,800/month.
Look for EBS:SnapshotUsage under EC2 — Other. To audit, list every snapshot you own and sort by size:
aws ec2 describe-snapshots --owner-ids self \
--query 'Snapshots[].[SnapshotId,VolumeSize,StartTime,Description]' \
--output table | sort -k3 -n -r | head -50
The fix is a Data Lifecycle Manager policy with a sensible retention (I default to 7 daily, 4 weekly, 3 monthly for production; nothing for dev). For the legacy mess, I usually write a small script that deletes snapshots older than 90 days whose source volumes no longer exist — but always dry-run it first, and never delete snapshots that are the source for an AMI you still use.
Suspect 4: Cross-AZ and inter-region data transfer
Data Transfer is the line item that confuses everyone because it has about a dozen sub-types. The two that bite hardest in 2026:
- Cross-AZ traffic at $0.01/GB each way — that's $0.02/GB round-trip. An RDS replica in a different AZ than its primary, or an EKS pod talking to a service in a different AZ, can pile this up surprisingly fast.
- Inter-region replication at $0.02 to $0.09/GB depending on the regions. S3 Cross-Region Replication left running on a bucket that grew 10x is a classic.
The line items to grep for are DataTransfer-Regional-Bytes (cross-AZ in the same region) and DataTransfer-Inter-Region-Out. If the Regional one is your spike, the fix is usually topology: collocate chatty services in the same AZ, or move the chattiness inside a single pod. If Inter-Region is the spike, audit your replication policies and your CloudFront origin configuration.
For deeper inspection I usually fall back to VPC Flow Logs queried with the same Athena pattern as suspect 1. We covered a related angle in our NAT Gateway vs VPC endpoint cost breakdown, which has the math for when an interface endpoint pays for itself.
Suspect 5: RDS storage growth and IOPS
RDS gets bigger in two ways and both hit you. The first is allocated storage growing on its own if you have storage autoscaling on (which you should, but the upper bound matters). The second, and more common spike cause, is backup storage — RDS gives you free backup storage equal to your allocated storage, but anything beyond that is billed at standard EBS rates. A long-running point-in-time-recovery window on a busy database can quietly accumulate hundreds of GB.
Check what you're actually paying for:
aws rds describe-db-instances \
--query 'DBInstances[].[DBInstanceIdentifier,AllocatedStorage,BackupRetentionPeriod,StorageType,Iops]' \
--output table
If BackupRetentionPeriod is 35 (the maximum) on a database that doesn't need it, drop it to 7. If StorageType is io2 on a workload that gets < 5,000 IOPS at peak, switch to gp3 — you can do this online and it can cut storage costs in half.
One subtler trap: if you took a manual snapshot before an upgrade and forgot to delete it, manual snapshots persist forever and bill at full EBS rates. I find about one of these per audit.
Suspect 6: S3 with no lifecycle policy
S3 is rarely the source of a sudden doubling, but it is almost always the source of a slow, grinding monthly increase that nobody noticed. The line items to watch are TimedStorage-ByteHrs (standard storage) and Requests-Tier1/Tier2 (PUT/GET requests).
For one client this year, the spike was actually S3 Inventory running daily on a 400 TB bucket and writing the manifest back into the same bucket without expiration — they were paying for the inventory of the inventory of the inventory. Always set lifecycle rules.
A starter lifecycle policy I use as a default:
{
"Rules": [
{
"Id": "expire-incomplete-uploads",
"Status": "Enabled",
"AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 7 }
},
{
"Id": "transition-to-ia",
"Status": "Enabled",
"Filter": { "Prefix": "" },
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER_IR" }
]
}
]
}
Incomplete multipart uploads in particular are an invisible cost — they don't show in the bucket size in the console, but you pay for them. I have seen buckets where 30% of the billed storage was failed uploads from a misconfigured client. Our S3 storage class decision tree walks through when each tier actually pays off versus when the retrieval fees eat your savings.
Closing: how to not be surprised again
After you have stopped the bleeding, do three things the same day. Set an AWS Budget with an alert at 80% of your normal monthly spend — the alert email is ugly but it works. Enable Cost Anomaly Detection (it is free and it uses the same ML that powers the Cost Explorer forecast); it would have caught my client's $5,600 spike within four hours instead of thirty-six. And turn on hourly granularity in Cost Explorer preferences so the next time you investigate, you have the data already.
The reason bills double overnight is almost never that AWS suddenly raised prices. It is that some piece of automation — a backup retention policy, an autoscaling group, a log shipper, a replication rule — quietly started doing what it was configured to do, against a dataset that grew past the threshold where it stopped being cheap. The drilldown above will find it in about an hour. Then you fix the root cause and write a Cost Anomaly Detection rule so you don't have to find it again.
FAQ
How long does it take for AWS billing to catch up after I delete a resource?
Usage data in Cost Explorer typically lags by 8 to 24 hours. The actual billed amount on the invoice reconciles within 72 hours. If you deleted something yesterday and the daily cost has already dropped in Cost Explorer, you are good — you do not need to wait for the next invoice.
Can I get a refund for a surprise AWS bill?
Sometimes, yes. AWS Support will often grant a one-time goodwill credit for a clear configuration mistake, especially for new accounts or first-time spikes. Open a billing case, be specific about the root cause and the fix, and ask politely. Don't expect it twice.
Does Savings Plans or Reserved Instances help with a sudden spike?
Not in the short term. Both commit you to a baseline of usage for one or three years in exchange for a discount, so they're a structural optimization, not an emergency tool. If your spike is genuine sustained new usage (not a leak), then yes, buying a Compute Savings Plan that covers the new baseline can save 30 to 50%. But fix the leak first, then size the commitment to actual demand.
Why is my "EC2 — Other" line so much bigger than EC2 itself?
Because "EC2 — Other" is a bucket for everything EC2-adjacent that isn't a running instance: EBS volumes, EBS snapshots, NAT Gateways, Elastic IPs that aren't attached, data transfer. Re-grouping that line by Usage Type in Cost Explorer is always the first move — it almost always points straight at the problem.