Building an AI Incident Responder That Actually Works

Bioluminescent neural mesh of gold and cyan synapses diagnosing an incident above a sleeping SRE engineer at dawn, DEVOPSARG coffee mug on desk

It was 3:47 AM on a Tuesday when our Slack lit up: "API latency spike — p99 over 8 seconds." The on-call engineer opened their laptop, rubbed their eyes, and started the familiar ritual: check Grafana, grep the logs, trace the request, check the deployment history, try to correlate events across 14 microservices.

By the time they found the root cause — a bad connection pool config deployed 6 hours earlier — it had been 2.5 hours. The incident was resolved, but the damage was done: SLA breach, unhappy customers, and an exhausted engineer.

That incident was the last straw. We decided to build something better.

The Problem: Incident Response is Mostly Data Gathering

Here's a dirty secret about incident response: 80% of the time is spent gathering context, not fixing things. The actual fix is usually simple — rollback a deploy, scale up a service, restart a pod. But finding what to fix requires correlating data across dozens of sources.

We mapped out what our engineers actually do during an incident:

  1. Check alerting dashboards (Grafana/Datadog)
  2. Read the last 15 minutes of logs from affected services
  3. Check recent deployments (kubectl rollout history)
  4. Look at resource utilization (CPU, memory, connections)
  5. Trace a sample failing request across services
  6. Check if similar incidents happened before
  7. Formulate a hypothesis and verify

Steps 1-6 are pure data gathering. An AI agent can do all of that faster and more consistently than a sleep-deprived human.

Architecture: The Three-Phase Agent

We built our incident responder as a Cloudflare Worker that connects to Claude via the Anthropic API. The agent works in three phases:

Phase 1: Context Collection

When an alert fires, a webhook triggers our agent. It immediately starts collecting context in parallel:

async function collectContext(alert) {
  const [logs, deploys, metrics, traces] = await Promise.all([
    fetchRecentLogs(alert.service, '15m'),
    fetchRecentDeploys(alert.namespace, '24h'),
    fetchMetricsSnapshot(alert.service),
    fetchFailingTraces(alert.service, 5),
  ]);

  return { alert, logs, deploys, metrics, traces };
}

We pull from:

  • Loki for logs (via LogQL API)
  • ArgoCD for deployment history
  • Prometheus for metrics (CPU, memory, error rates, latency percentiles)
  • Jaeger for distributed traces

All data is fetched in parallel. Total collection time: under 3 seconds.

Phase 2: Analysis with Claude

We send the collected context to Claude Haiku with a carefully crafted system prompt:

const analysis = await anthropic.messages.create({
  model: 'claude-haiku-4-5-20251001',
  max_tokens: 2048,
  system: `You are an SRE incident analyst. Given alert data, logs, 
    metrics, deployment history, and traces, identify the most likely 
    root cause and suggest remediation steps.
    
    Rules:
    - Be specific: name the exact service, pod, or config that's failing
    - Correlate timing: if a deploy happened before the alert, flag it
    - Suggest the safest fix first (rollback > restart > scale > config change)
    - Rate your confidence: HIGH / MEDIUM / LOW
    - If unsure, say so — never guess on production systems`,
  messages: [{ role: 'user', content: formatContext(context) }],
});

We chose Claude Haiku for speed — analysis completes in under 2 seconds. For complex incidents, we escalate to Sonnet.

Phase 3: Action Recommendation

The agent posts to Slack with a structured incident brief:

🔴 INCIDENT DETECTED — api-gateway latency spike

📊 Summary:
- p99 latency jumped from 200ms to 8.2s at 03:41 UTC
- Error rate increased from 0.1% to 12.3%
- Affected: api-gateway, user-service (downstream)

🔍 Root Cause (HIGH confidence):
Deploy api-gateway v2.14.3 at 21:30 UTC changed connection 
pool max from 100 to 10 (likely typo in PR #847)

💡 Recommended Actions:
1. ROLLBACK api-gateway to v2.14.2 (safest)
   → kubectl rollout undo deployment/api-gateway -n production
2. OR hotfix: set POOL_MAX=100 in configmap
   → kubectl edit configmap api-gateway-config -n production

📋 Evidence:
- Connection pool exhaustion visible in logs (47 occurrences)
- Latency spike correlates exactly with deploy timestamp
- Pre-deploy metrics were healthy (p99: 180ms)

The on-call engineer can now act immediately instead of spending 2 hours gathering context.

Results: From 2.5 Hours to 8 Minutes

After 3 months in production:

  • MTTR dropped from 2.5 hours to 18 minutes average (8 minutes for agent-identified incidents)
  • 62% of incidents were auto-diagnosed correctly (HIGH confidence, verified by engineers)
  • False positive rate: 8% (agent said HIGH confidence but root cause was different)
  • On-call satisfaction went from 2.1 to 4.3 (out of 5) in team surveys

The biggest surprise? Engineers started trusting the agent within the first week. When they saw it correctly identify a cascading failure across 3 services in 4 seconds — something that would have taken them 30+ minutes — they were sold.

Lessons Learned

Start with read-only. Our agent doesn't execute any actions automatically. It only recommends. This was critical for building trust. We'll add auto-remediation for simple cases (like rollbacks) once we have 6 months of accuracy data.

Haiku is fast enough. We initially planned to use Sonnet for everything, but Haiku's speed (sub-2-second analysis) makes a real difference at 3 AM. We only escalate to Sonnet for multi-service incidents.

Context formatting matters more than prompt engineering. We spent more time formatting the data we send to Claude (clean log excerpts, pre-aggregated metrics, deployment diffs) than on the system prompt. Garbage in, garbage out.

Track everything. We log every analysis alongside the actual root cause (determined post-incident). This gives us accuracy metrics and training data for improving our prompts.

What's Next

We're working on Phase 2: auto-remediation for high-confidence diagnoses. If the agent is 95%+ confident and the fix is a rollback, why wake up a human? We're building a confirmation flow where the agent proposes, a second Claude instance validates, and only then executes.

The goal isn't to replace on-call engineers — it's to let them sleep through the incidents that don't need human creativity.


Want to build something similar? We've helped 3 teams implement AI-powered incident response. Get in touch — we'll share our playbook.

Frequently Asked Questions

Why Claude Haiku instead of Sonnet or Opus for incident response?

Speed wins at 3 AM. Haiku analyzes a full incident context in under 2 seconds. Sonnet takes 4-6 seconds, Opus 8-12. At 3 AM with a p99 spike, the engineer needs an answer NOW, not a more nuanced answer later. For the 5-10% of incidents where Haiku's confidence is low, we escalate to Sonnet automatically. This covers multi-service cascading failures where more reasoning matters.

Is auto-remediation actually safe for production?

Only behind strict confidence gates and for safe actions. Our current gate: agent must rate HIGH confidence, a second Claude instance must validate the diagnosis, and the action must be in the safe-list (rollback, restart pod, scale up — never config changes, never data mutations, never network policy edits). We start with read-only recommendations and only turn on auto-remediation after 6 months of accuracy data showing 95%+ HIGH-confidence correctness.

What data sources does the agent need access to?

Four minimum: (1) logs (we use Loki via LogQL API, but Datadog/CloudWatch/Elasticsearch work too), (2) metrics (Prometheus or equivalent — CPU, memory, error rates, latency percentiles), (3) deployment history (ArgoCD API, or whatever GitOps tool fires your rollouts), (4) distributed traces (Jaeger, Tempo, or Datadog APM). The agent fetches all four in parallel with Promise.all — total context collection is usually under 3 seconds.

How do you measure agent accuracy over time?

Log every analysis alongside the actual resolved root cause (determined during the post-incident review). This gives you two metrics: diagnosis accuracy (was the HIGH-confidence call right?) and coverage (what fraction of incidents did the agent even attempt). We track these in the same Grafana dashboard as our infrastructure metrics so the trend is visible to the whole team.

What happens if the agent is wrong?

Because it's read-only during the trust-building phase, being wrong just means the engineer ignores the recommendation and debugs manually — zero cost. We track false positive rate (currently ~8%) and use those incidents to improve the context formatting and system prompt. The 'garbage in, garbage out' rule applies strongly — most accuracy wins come from better data formatting, not better prompts.

Can this replace on-call engineers?

No, and that's not the goal. The goal is to let on-call engineers sleep through incidents that don't need human creativity. Simple incidents — bad config, scaling issue, known pattern — the agent handles. Novel incidents, multi-system failures, anything requiring judgment calls — humans still drive. The agent is an amplifier, not a replacement. Most 3 AM pages should be boring rollbacks; the novel stuff should wait for a caffeinated human in the morning.