AWS FINOPS KUBERNETES

Case Study: $240K/Year AWS Savings for a Healthcare SaaS

Q: What's the first thing to look at when an AWS bill suddenly jumps?

Three suspects in order: (1) EC2/EKS compute (usually 40-60% of the bill — check for oversized instances, missing HPAs, no Spot), (2) Data transfer and NAT Gateway (often 10-20% of the bill, almost always fixable with VPC endpoints and topology-aware routing), (3) RDS or ElastiCache with Multi-AZ On-Demand (Reserved Instances alone can cut this 40-50%). Use AWS Cost Explorer grouped by service to triangulate fast.

Q: Reserved Instances or Savings Plans — which should we use?

Compute Savings Plans for your On-Demand baseline (most flexible — applies to EC2, Fargate, and Lambda). Reserved Instances for RDS, ElastiCache, and Redshift where Savings Plans don't apply. Go with 1-year No Upfront for the predictable baseline — that's 30-40% off with almost no lock-in risk. Save 3-year All Upfront for absolutely-stable workloads (database primaries, control plane).

Q: How do you handle Spot interruptions in production without customer impact?

Four rules: (1) diversify instance types (10+ families minimum) so AWS always has Spot capacity to give you, (2) spread across 3 AZs so a single-zone interruption event doesn't kill everything, (3) set terminationGracePeriodSeconds: 120 on all Spot workloads so they drain cleanly on the 2-minute warning, (4) keep stateful services (Kafka brokers, Redis primaries) on On-Demand only. With this setup, real-world interruption rates are under 5% and user-visible incidents are zero.

Max Moretti | February 19, 2026 | 6 min read

Isometric flat illustration of a declining AWS cost bar chart from red to green with a stethoscope-cloud icon and DEVOPSARG laptop sticker

TL;DR — Key Takeaways

Healthcare SaaS was spending $38K/month on AWS with zero cost visibility. 2-week audit + 6 weeks of implementation brought the bill down to $18.2K/month.
52% reduction, $237,600/year saved. The biggest single win was EC2/EKS compute: $16.4K → $6.8K via Karpenter + Spot + right-sizing.
NAT Gateway was the silent killer: $2.6K/month on data processing alone, fixed with ECR/S3/STS/CloudWatch VPC endpoints ($1.8K/month saved).
RDS Reserved Instances + read replicas: $7.2K → $3.6K. ElastiCache right-sizing: $3.1K → $1.4K. EBS GP2 → GP3 + orphan cleanup saved another $1K.
The cultural shift mattered more than the tech: real-time cost dashboards made engineers optimize proactively instead of waiting for the 30-day bill.

A healthcare SaaS company came to us with a simple problem: their AWS bill had grown from $12K/month to $38K/month in 18 months, but their user base had only doubled. Something was scaling linearly when it should have been sublinear.

Their VP of Engineering put it bluntly: "We have 50,000 users and we're spending $38K/month. Our competitor has 200,000 users and spends less than us. What are we doing wrong?"

After a 2-week audit and 6 weeks of implementation, we brought their bill down to $18,200/month — a 52% reduction, saving $237,600/year. Here's every single thing we changed.

The Audit

We started by categorizing spend by AWS service:

Service	Monthly Cost	% of Total
EC2 (EKS nodes)	$16,400	43%
RDS (PostgreSQL)	$7,200	19%
ElastiCache (Redis)	$3,100	8%
S3 + CloudFront	$2,800	7%
NAT Gateway	$2,600	7%
Data Transfer	$2,400	6%
EBS Volumes	$1,800	5%
Other	$1,700	5%
Total	$38,000	100%

Every single line item had optimization potential. Let's go through them.

1. EC2/EKS: Right-Size + Spot + Karpenter ($16,400 → $6,800)

This was the biggest win. Their EKS cluster was running on m5.2xlarge On-Demand instances because "that's what the AWS Quick Start guide suggested."

Changes:

Replaced Cluster Autoscaler with Karpenter
Added 15 instance types to the allowed list
Moved 75% of workloads to Spot instances
Right-sized every pod based on 2 weeks of VPA data
Added HPA to all stateless services

We wrote about the Karpenter setup in detail in our Karpenter + Spot + Scale-to-Zero post.

Savings: $9,600/month (58%)

2. RDS: Reserved Instances + Read Replicas ($7,200 → $3,600)

They were running a db.r6g.2xlarge PostgreSQL RDS instance — On-Demand, Multi-AZ. The database was at 15% CPU utilization on average.

Changes:

Downsized to db.r6g.xlarge (CPU was only hitting 40% during peak with the smaller instance)
Purchased a 1-year All Upfront Reserved Instance (42% discount)
Added a read replica for analytics queries that were hammering the primary
Moved nightly batch jobs to hit the replica instead of primary

-- Before: analytics queries on primary
SELECT date_trunc('day', created_at), count(*)
FROM patient_records 
WHERE created_at > now() - interval '90 days'
GROUP BY 1;

-- After: same query routed to read replica via connection string
-- analytics_db_url = postgres://replica-endpoint:5432/healthdb

Savings: $3,600/month (50%)

3. ElastiCache: Right-Size + Reserved ($3,100 → $1,400)

Running cache.r6g.xlarge with 3% memory utilization. They were caching session data for 50K users — that fits in a cache.r6g.large with room to spare.

Changes:

Downsized to cache.r6g.large
Purchased 1-year Reserved Instance
Implemented TTL on all cache keys (they had 2M keys with no expiry)

Savings: $1,700/month (55%)

4. NAT Gateway: The Silent Budget Killer ($2,600 → $800)

This one surprised everyone. NAT Gateway charges $0.045/GB for data processing — and their pods were pulling Docker images through NAT on every deploy.

Changes:

Configured ECR VPC endpoints (no more NAT for image pulls)
Added S3 VPC endpoint (logs and backups were going through NAT)
Configured STS and CloudWatch VPC endpoints
Moved non-essential traffic to instances with public IPs

# VPC Endpoints we added
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-xxx \
  --service-name com.amazonaws.us-east-1.ecr.api \
  --vpc-endpoint-type Interface

aws ec2 create-vpc-endpoint \
  --vpc-id vpc-xxx \
  --service-name com.amazonaws.us-east-1.s3 \
  --vpc-endpoint-type Gateway

Savings: $1,800/month (69%)

NAT Gateway costs are one of the most overlooked line items in AWS bills. Every company we audit is overpaying for NAT.

5. S3 + CloudFront: Lifecycle Policies + Compression ($2,800 → $1,600)

They were storing every version of every file forever. Medical document uploads from 3 years ago were still in S3 Standard.

Changes:

S3 Intelligent-Tiering for all buckets (auto-moves cold data to cheaper tiers)
Lifecycle policy: move to Glacier after 1 year for compliance archives
Enabled CloudFront compression (Brotli) — reduced bandwidth 40%
Configured proper cache headers — CDN hit ratio went from 60% to 94%

{
  "Rules": [
    {
      "ID": "ArchiveOldDocuments",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 365,
          "StorageClass": "GLACIER"
        }
      ]
    }
  ]
}

Savings: $1,200/month (43%)

6. Data Transfer: Keep Traffic Inside the VPC ($2,400 → $1,200)

Cross-AZ data transfer charges were eating them alive. Services in us-east-1a were talking to services in us-east-1c, paying $0.01/GB each way.

Changes:

Configured topology-aware routing in Kubernetes (prefer same-AZ)
Moved chatty services into the same AZ
Compressed inter-service payloads (gRPC with protobuf instead of JSON)

# Topology-aware routing
apiVersion: v1
kind: Service
metadata:
  name: user-service
  annotations:
    service.kubernetes.io/topology-mode: Auto

Savings: $1,200/month (50%)

7. EBS Volumes: Delete Orphans + Change Types ($1,800 → $800)

22 unattached EBS volumes sitting there doing nothing. PersistentVolumes from deleted pods that nobody cleaned up. Classic.

Changes:

Deleted 22 orphaned EBS volumes (saved $400/month immediately)
Changed GP2 volumes to GP3 (20% cheaper, better performance)
Reduced snapshot frequency from hourly to daily for non-critical volumes

# Find orphaned volumes
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}' \
  --output table

Savings: $1,000/month (56%)

The Final Scorecard

Service	Before	After	Savings	%
EC2/EKS	$16,400	$6,800	$9,600	58%
RDS	$7,200	$3,600	$3,600	50%
ElastiCache	$3,100	$1,400	$1,700	55%
NAT Gateway	$2,600	$800	$1,800	69%
S3/CloudFront	$2,800	$1,600	$1,200	43%
Data Transfer	$2,400	$1,200	$1,200	50%
EBS	$1,800	$800	$1,000	56%
Other	$1,700	$2,000	-$300	-18%
Total	$38,000	$18,200	$19,800	52%

"Other" went up slightly because we added monitoring tools (Kubecost, custom exporters) that have a small compute cost. Worth every penny.

Annual savings: $237,600

The entire project — audit, implementation, testing, documentation — took 8 weeks and cost them a fraction of one month's savings.

The Most Important Change

The technical optimizations were important, but the cultural change mattered more. We installed our FinOps dashboard on day one of the project, so the team could see costs in real-time from the start.

By week 3, engineers were coming to us with optimization ideas we hadn't thought of. One developer noticed their service was making 10x more S3 API calls than necessary due to a missing cache layer. Another found a cron job that was spinning up a large instance for 2 minutes every hour.

When you make costs visible, engineers optimize naturally. They just need the data.

AWS bill growing faster than your user base? That's normal — and fixable. Get a free infrastructure assessment and we'll show you exactly where the waste is.

Frequently Asked Questions

How long does an AWS cost optimization project like this typically take?

For a cluster/bill this size (~$38K/month, single region, ~40 microservices), expect a 2-week audit followed by 4-8 weeks of implementation. Most wins are safe to ship in the first 2-3 weeks — right-sizing, Reserved Instances, VPC endpoints, lifecycle policies. Karpenter migration and Spot adoption take another 2-4 weeks because they need proper testing and gradual rollout.

What's the first thing to look at when an AWS bill suddenly jumps?

Three suspects in order: (1) EC2/EKS compute (usually 40-60% of the bill — check for oversized instances, missing HPAs, no Spot), (2) Data transfer and NAT Gateway (often 10-20% of the bill, almost always fixable with VPC endpoints and topology-aware routing), (3) RDS or ElastiCache with Multi-AZ On-Demand (Reserved Instances alone can cut this 40-50%). Use AWS Cost Explorer grouped by service to triangulate fast.

Can you really cut EKS costs by 50-70% without breaking production?

Yes, routinely. The standard recipe: (a) migrate from Cluster Autoscaler to Karpenter for better bin packing, (b) diversify instance types to 10+ families so Spot capacity is always available, (c) classify workloads into Spot-ready / Spot-fallback / On-Demand-only tiers, (d) right-size pods based on 2 weeks of VPA recommendation data, (e) scale dev/staging to zero outside business hours with KEDA cron triggers. Done carefully, zero production incidents — we've shipped this on clusters running 200+ microservices.

Why is NAT Gateway so expensive, and is VPC endpoints really the fix?

NAT Gateway charges $0.045/GB for data processing on top of hourly fees. If your pods pull Docker images, write logs to S3, call STS, or hit CloudWatch APIs through NAT, you're paying the processing fee on every byte. ECR VPC endpoint, S3 gateway endpoint, STS and CloudWatch interface endpoints fix this — the traffic stays inside your VPC and bypasses NAT entirely. In our case this alone saved $1,800/month (69% reduction on data transfer costs).

Reserved Instances or Savings Plans — which should we use?

Compute Savings Plans for your On-Demand baseline (most flexible — applies to EC2, Fargate, and Lambda). Reserved Instances for RDS, ElastiCache, and Redshift where Savings Plans don't apply. Go with 1-year No Upfront for the predictable baseline — that's 30-40% off with almost no lock-in risk. Save 3-year All Upfront for absolutely-stable workloads (database primaries, control plane).

How do you handle Spot interruptions in production without customer impact?

Four rules: (1) diversify instance types (10+ families minimum) so AWS always has Spot capacity to give you, (2) spread across 3 AZs so a single-zone interruption event doesn't kill everything, (3) set terminationGracePeriodSeconds: 120 on all Spot workloads so they drain cleanly on the 2-minute warning, (4) keep stateful services (Kafka brokers, Redis primaries) on On-Demand only. With this setup, real-world interruption rates are under 5% and user-visible incidents are zero.