Pre-scaling a Kubernetes cluster reactively never works for breaking-news traffic. By the time Cluster Autoscaler notices CPU spiking, the front page is already serving 503s. The cluster reacts in minutes; the traffic arrives in seconds. The only way out is to scale before the traffic hits — and for that you need a signal the metrics layer cannot give you.
News sites have a brutally bimodal traffic pattern: modest and predictable most of the time, then 20x–50x in minutes when a major event breaks. The standard industry response is permanent overprovisioning. It's expensive, and worse, it still fails: even with substantial idle capacity, a spike generates hot spots, saturated connections, and stateful services that can't scale as fast as stateless ones.
There's a timing asymmetry almost nobody exploits: when an editor marks a story as breaking news in the CMS, it doesn't go live instantly. It goes through legal review, final copy edits, editor-in-chief approval — typically 30 to 90 seconds. That window is enough time to pre-scale the infrastructure before the traffic exists.
In this post we walk through the complete pattern: a webhook from the CMS, a Lambda detector, KEDA with an external trigger, and a dedicated Karpenter NodePool that provisions in 30–60 seconds instead of the 3–5 minutes you get from Cluster Autoscaler.
The Problem with the Reactive Model
Before getting into the pattern, it's worth understanding why the classic approach — overprovision plus Cluster Autoscaler — breaks down. The table below shows typical metrics for an overprovisioned news-site cluster; the numbers are illustrative but reflect what we see repeatedly in audits:
| Metric | Typical Value |
|---|---|
| Average CPU utilization | 8–15% |
| Average memory utilization | 20–30% |
| Cluster Autoscaler reaction time | 3–5 minutes |
| New pod warm-up time | 20–60 seconds |
| Traffic multiplier during breaking news | 20x–50x |
| Time between publication and traffic spike | 10–30 seconds |
The last two rows kill the reactive model. Even if the base infrastructure absorbed a 5x, a 20x blows straight through it. And even if Cluster Autoscaler eventually provisions the nodes, they arrive 2–3 minutes late — when the spike has already broken everything. The right question isn't "how do we make the Autoscaler faster?" It's: how do we inform the cluster before the traffic exists?
Phase 1: CMS Webhook and Lambda Detector
The first link in the chain is a webhook on the editorial CMS. Most modern CMS platforms — WordPress VIP, Arc Publishing, Strapi, proprietary editorial systems — support state-change webhooks. We added one that fires whenever a story transitions to breaking_news status or receives a high-priority flag. The webhook points to a Lambda that validates the HMAC signature, applies minimal filters (deduplication, throttling), and writes a pre-scale event to DynamoDB — the source of truth that KEDA consumes downstream.
# breaking_news_detector/handler.py
import hashlib
import hmac
import json
import os
import time
import boto3
WEBHOOK_SECRET = os.environ["CMS_WEBHOOK_SECRET"]
TABLE_NAME = os.environ["PRESCALE_TABLE"]
MIN_INTERVAL_SECONDS = 60 # anti-duplicate window
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table(TABLE_NAME)
def verify_signature(body: bytes, signature: str) -> bool:
expected = hmac.new(
WEBHOOK_SECRET.encode(),
body,
hashlib.sha256,
).hexdigest()
return hmac.compare_digest(expected, signature)
def should_prescale(story: dict) -> bool:
if story.get("status") != "breaking_news":
return False
if story.get("priority", "normal") not in ("high", "critical"):
return False
return True
def recently_fired(story_id: str) -> bool:
resp = table.get_item(Key={"story_id": story_id})
item = resp.get("Item")
if not item:
return False
return (time.time() - item["ts"]) < MIN_INTERVAL_SECONDS
def emit_prescale_event(story: dict) -> None:
table.put_item(
Item={
"story_id": story["id"],
"ts": int(time.time()),
"ttl": int(time.time()) + 900, # 15 min TTL
"priority": story["priority"],
"estimated_traffic_multiplier": story.get("multiplier", 20),
"source": "cms_editorial",
}
)
def lambda_handler(event, context):
body = event["body"].encode() if isinstance(event["body"], str) else event["body"]
signature = event["headers"].get("x-cms-signature", "")
if not verify_signature(body, signature):
return {"statusCode": 401, "body": "invalid signature"}
story = json.loads(body)
if not should_prescale(story):
return {"statusCode": 200, "body": "no prescale needed"}
if recently_fired(story["id"]):
return {"statusCode": 200, "body": "already fired"}
emit_prescale_event(story)
return {"statusCode": 202, "body": "prescale event emitted"}
Key decisions:
- HMAC validation first, everything else second. The endpoint is public; an attacker who discovers the URL could fire fake pre-scale events every 10 seconds and light your bill on fire. Signature verification is non-negotiable.
- Deduplication with a short TTL. The same
story_idcan trigger multiple webhooks — title change, final edit, editor approval. We ignore events for the same story within a 60-second window. - 15-minute TTL on DynamoDB items. Items self-delete; the table never grows unbounded.
- No scaling happens here. The Lambda doesn't talk to Kubernetes or Karpenter — it only emits the event. That keeps it simple, fast (<50ms p99), and easy to test.
Phase 2: KEDA with an External Trigger
KEDA (Kubernetes Event-Driven Autoscaling) is what translates "there's an event in DynamoDB" into "scale these Deployments now." We configured a ScaledObject per critical service, each with a DynamoDB trigger pointing at the prescale_events table.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: public-cms-prescale
namespace: news-frontend
spec:
scaleTargetRef:
name: public-cms
pollingInterval: 5 # seconds
cooldownPeriod: 300 # 5 min before scaling back down
minReplicaCount: 6
maxReplicaCount: 120
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 300
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
triggers:
- type: aws-dynamodb
metadata:
tableName: prescale_events
awsRegion: us-east-1
keyConditionExpression: "source = :source"
expressionAttributeValues: '{":source": {"S": "cms_editorial"}}'
targetValue: "1"
activationTargetValue: "0"
identityOwner: operator
- type: prometheus
metadata:
serverAddress: http://prometheus.observability.svc:9090
metricName: http_requests_per_second
query: sum(rate(nginx_ingress_controller_requests{service="public-cms"}[1m]))
threshold: "500"
Two triggers by design:
- The DynamoDB trigger is the predictive signal. As soon as an item appears in the table, KEDA pushes the Deployment toward its target replica count.
pollingInterval: 5means KEDA detects the event at most 5 seconds after the Lambda wrote it. - The Prometheus trigger is the safety net. If the webhook fails — network issue, rotated signature, an editor who published without marking as breaking news — the classic RPS-based trigger keeps working. Never rely solely on the predictive signal.
The behavior.scaleUp configuration is critical. By default the HPA scales gradually; 300% per 15s tells it to triple the pod count every 15 seconds. For breaking news you want aggression — every second of gradual scale-up pays out in 503s. The scaleDown, by contrast, is conservative: 10% every 60 seconds with a 5-minute stabilization window. Breaking-news spikes have long tails (readers keep arriving 10–20 minutes after the story breaks), and pulling the infrastructure down too fast exposes you to a second spike when the topic resurfaces on social.
For Heavy Jobs: ScaledJob
Some services aren't Deployments but jobs — for instance, regenerating the homepage's static assets or warming the search cache. For those we used ScaledJob:
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: homepage-cache-warmer
namespace: news-frontend
spec:
jobTargetRef:
parallelism: 8
completions: 8
backoffLimit: 2
template:
spec:
restartPolicy: Never
containers:
- name: warmer
image: registry.internal/cache-warmer:1.4.2
args: ["--target=homepage", "--variants=all"]
resources:
requests:
cpu: "500m"
memory: "512Mi"
pollingInterval: 5
maxReplicaCount: 8
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
triggers:
- type: aws-dynamodb
metadata:
tableName: prescale_events
awsRegion: us-east-1
keyConditionExpression: "source = :source"
expressionAttributeValues: '{":source": {"S": "cms_editorial"}}'
targetValue: "1"
This fires 8 parallel jobs as soon as the event appears. By the time the story goes live, the cache is warm and the first requests hit the CDN, not the origin.
Phase 3: Karpenter with a Dedicated NodePool
KEDA scales the Deployments, but if the nodes don't exist the pods stay Pending. The key is a dedicated NodePool for breaking-news traffic, isolated from the cluster's general NodePool:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: breaking-news-burst
spec:
template:
metadata:
labels:
workload-tier: breaking-news
spec:
taints:
- key: workload-tier
value: breaking-news
effect: NoSchedule
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- c6i.2xlarge
- c6i.4xlarge
- c6a.2xlarge
- c6a.4xlarge
- c7i.2xlarge
- c7i.4xlarge
- m6i.2xlarge
- m6i.4xlarge
- m6a.2xlarge
- m6a.4xlarge
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b", "us-east-1c"]
nodeClassRef:
name: default
expireAfter: 2h
limits:
cpu: "1000"
memory: 2000Gi
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 180s
budgets:
- nodes: "10%"
Key decisions:
- Taints plus labels. The NodePool carries the taint
workload-tier=breaking-news:NoSchedule. Only the critical Deployments (with the matching toleration andnodeSelector) land here. The rest of the cluster never touches this pool. - CPU-dense families.
c6i,c6a,c7i,m6i,m6a— all CPU-first. News sites under load are CPU-bound from HTML rendering, SSR, and TLS termination. We avoided burstable (tfamily, unpredictable) and memory-optimized (rfamily, unnecessary for this workload). consolidationPolicy: WhenEmpty, notWhenUnderutilized. During a spike you want nodes to stay up even if they're temporarily underutilized.WhenUnderutilizedwould consolidate aggressively mid-spike and generate churn at exactly the wrong moment.expireAfter: 2h. After a spike, Karpenter recycles nodes even if KEDA still holds pods up through the cooldown period. Prevents aging Spot nodes with pending interruptions from lingering indefinitely.disruption.budgets: 10%. No more than 10% of the pool's nodes can be under replacement simultaneously.
The critical Deployments carry this configuration to land on the right pool:
apiVersion: apps/v1
kind: Deployment
metadata:
name: public-cms
namespace: news-frontend
spec:
template:
spec:
tolerations:
- key: workload-tier
operator: Equal
value: breaking-news
effect: NoSchedule
nodeSelector:
workload-tier: breaking-news
terminationGracePeriodSeconds: 60
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: public-cms
containers:
- name: public-cms
image: registry.internal/public-cms:2.18.0
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 2
periodSeconds: 2
failureThreshold: 3
startupProbe:
httpGet:
path: /health/startup
port: 8080
failureThreshold: 15
periodSeconds: 2
resources:
requests:
cpu: "1500m"
memory: "1Gi"
limits:
cpu: "3000m"
memory: "2Gi"
The probes are tuned for fast warm-up: the startupProbe allows up to 30 seconds to come up (15 × 2s), and the readinessProbe marks the pod ready after just 3 consecutive successful checks — 6 seconds in the best case. With Karpenter provisioning nodes in 30–60 seconds, the complete end-to-end flow from "Lambda receives webhook" to "new pods are serving traffic" fits comfortably within 60–90 seconds — well inside the editorial review window.
Phase 4: Observability for the Pattern
A predictive pattern you don't measure is a pattern you can't debug. We instrumented the entire flow with Prometheus to confirm that pre-scaling genuinely arrives before the traffic. The four key queries:
# 1. Time from webhook receipt to new pods ready
histogram_quantile(0.95,
sum(rate(prescale_webhook_to_ready_seconds_bucket[10m])) by (le, service)
)
# 2. Extra pods raised by the pattern vs baseline
sum(kube_deployment_status_replicas{deployment="public-cms"})
- on() group_left
sum(kube_deployment_spec_replicas_baseline{deployment="public-cms"})
# 3. Real RPS vs pre-scaled capacity (should stay below 70% at peak)
sum(rate(nginx_ingress_controller_requests{service="public-cms"}[1m]))
/
(sum(kube_deployment_status_replicas{deployment="public-cms"}) * 200)
# 4. Pending pods during a spike (should be 0 or near-zero)
sum(kube_pod_status_phase{phase="Pending", namespace="news-frontend"})
The first query matters most: if the p95 of prescale_webhook_to_ready_seconds climbs above 90 seconds, the pattern has broken its contract. Typical p95 with this configuration sits between 50 and 75 seconds; p99 rarely exceeds 100. We added an alert to catch regressions:
- alert: PrescalingMissingWindow
expr: |
histogram_quantile(0.95,
sum(rate(prescale_webhook_to_ready_seconds_bucket[10m])) by (le)
) > 90
for: 10m
labels:
severity: critical
annotations:
summary: "Pre-scaling is slower than editorial review window"
description: "p95 end-to-end prescaling time exceeded 90s for 10+ minutes. Breaking news traffic may hit unprescaled infra."
Reactive vs Predictive: Order of Magnitude
This is the comparison that matters. The numbers are illustrative for a news-site cluster under breaking-news load, not from any specific deployment:
| Metric | Reactive + overprovisioning | Predictive + KEDA + Karpenter |
|---|---|---|
| Average utilization outside spikes | 8–15% | 40–60% |
| Time to react to a spike | 3–5 min | 30–90 sec |
| Ability to absorb 20x spikes | Partial (hot spots) | Full |
| Required baseline nodes | High (worst-case sizing) | Low (actual utilization) |
| Relative monthly cost | 1x | 0.1x – 0.3x |
| Expected downtime during breaking news | Minutes of 503s | Zero |
The biggest cost reduction doesn't come from the pre-scaling itself — it comes from being able to lower the baseline without fear. When you know you can go from 6 pods to 120 in 60 seconds with guarantees, you stop running 80 pods all the time "just in case." For clusters living at 8–15% average utilization, an order-of-magnitude reduction in compute spend is realistic without compromising availability during spikes.
What We'd Do Differently
The pattern works, but it carries honest tradeoffs worth admitting:
The CMS coupling is an operational risk. If the editorial team migrates the CMS, if someone refactors internal story states, if the HMAC secret is rotated incorrectly — the pattern breaks silently. The Prometheus fallback trigger is mandatory, not optional. On the day the webhook stops firing, you want the system to keep working, just less optimally.
Editors can fire ghost events. An editor who marks a story as breaking news and then decides not to publish it costs you 30–60 seconds of compute spun up for nothing. At low volume that's irrelevant; if your newsroom flags breaking news 40 times a day, you start paying for noise. It's worth adding a second filter — for example, only trigger if the flag comes with a high traffic-multiplier estimate, or only for stories carrying a specific editorial tag.
CDN cold start is still a problem. Pre-scaling the origin is valuable, but if the CDN has a cold cache for the new story you'll still see a burst of requests to the origin during the first 10–20 seconds. Pairing the pre-scale with a proactive CDN warm-up — hitting the article URL from multiple regions as soon as the event fires, so CloudFront or Fastly has the response cached before users arrive — closes that gap significantly.
Stateful services remain the bottleneck. This pattern works cleanly for stateless services — APIs, frontend, workers. For Postgres, Redis, Elasticsearch, there's no magic pre-scaling. If your database can't absorb a 20x, no CMS hook will save you. The pattern reduces the problem to "the DB and cache are the bottleneck," which is a cleaner and more tractable problem than "the entire stack is the bottleneck."
Measure your actual editorial window. We assumed 30–90 seconds, but in some newsrooms that number is 15 seconds and in others it's 3 minutes. Before committing to SLAs based on this pattern, measure how long the editorial flow actually takes in the specific CMS you're working with — ideally with a histogram broken down by time of day, because at 3 AM with one editor on duty, the window shrinks.
Does your infrastructure collapse when a story explodes? Are you overprovisioning the cluster because spikes feel impossible to predict? We've implemented this pattern in production and know where the edge cases hide. Let's talk — we'll show you how to adapt it to your editorial stack.