Skip to main content

How We Designed Nova's Investigation Engine — Lessons from SRE at Scale

· 9 min read
Rajesh RC
Founder

When pods crash at odd hours, you need an AI that investigates like an SRE, not a chatbot. You need something that checks the right things, in the right order, tells you what it found, and waits for your call before touching anything. Current tools either dump a wall of logs on you and say "good luck," or run opaque automations you can't see, can't trust, and can't explain in a postmortem.

We spent the last several months designing and building Nova's investigation engine. This post is about the approach we took, the mental models that shaped it, and the trade-offs we made along the way.

The problem with AI-assisted incident response

Most AI incident tools fall into two camps.

Camp 1: The log dumper. It connects to your monitoring stack, pulls a bunch of data, and presents it in a slightly nicer format. You still have to figure out what's wrong. It's a search engine with a chatbot UI.

Camp 2: The black box. It detects an anomaly, runs some automation, and tells you "fixed it." Maybe it did. Maybe it made it worse. You have no idea what it checked, what it ruled out, or why it chose that particular fix. Good luck writing the postmortem.

Neither camp works for production SRE. When you're the on-call engineer during an outage, you need to understand what's happening, control what happens next, and be able to explain it all afterward. You need a system that works with you, not instead of you.

Our approach: the operator loop

We designed Nova's investigation engine around a five-phase loop, inspired by three sources that have nothing to do with AI:

  • Google SRE's Incident Commander model — one coordinator makes decisions, specialists execute. The IC never runs commands directly.
  • Kubernetes operator reconciliation — a control loop that observes state, compares it to desired state, and takes action. Idempotent, resumable, crash-safe.
  • Medical differential diagnosis — form hypotheses, test the most likely ones first, use the least invasive tests, get informed consent before any procedure.

Here's the loop:

┌──────────────────────────────────────────────────┐
│ │
│ INVESTIGATE ──► PLAN ──► APPROVE ──► EXECUTE │
│ ▲ │ │
│ │ ▼ │
│ │◄──────────── OBSERVE ◄───────────┘ │
│ │
└──────────────────────────────────────────────────┘

Investigate — gather evidence. Check pod status, pull logs, look at recent deployments, query metrics. Multiple checks run in parallel, like a team of specialists each looking at their area.

Plan — form hypotheses, rank them by confidence, and show the full plan to the SRE before anything else happens. What will be checked, what tools are needed, what's missing, what the trade-offs are.

Approve — the SRE reviews the plan and approves the remediation scope. One decision, not twenty.

Execute — apply the fix. Scale memory, roll back a deployment, adjust a configuration. Every action is logged.

Observe — verify the fix worked. Watch the rollout, check health metrics, confirm the issue is resolved. If things don't look right, loop back.

This isn't a novel architecture. It's how experienced SREs already work. We just encoded it into a system that can do the tedious parts (gathering evidence, correlating data, tracking hypotheses) while keeping the human in control of the decisions that matter.

Plan-first, not execute-first

Before Nova touches any tool, it shows you the plan.

This is the single most important design decision we made. Most AI agent frameworks default to "think, then act" — the model decides what to do and does it, and you see the result after the fact. That works fine for a chatbot answering questions. It's terrifying for a system that can run commands against your production infrastructure.

Nova's plan shows everything:

  • What steps will run and in what order
  • Which tools and integrations are needed for each step
  • Which integrations are missing and what the impact is
  • Which steps are read-only and which require approval
  • The confidence level of each hypothesis

Think of it like a doctor showing you the differential diagnosis before ordering tests. "Based on the symptoms, here are the three most likely causes. Here's what I'd like to check first, and here's why. This test is non-invasive. That one requires a procedure — I'll need your consent."

The plan is not just informational. It's the SRE's control surface. You can see exactly what Nova intends to do, understand the trade-offs (especially when some tools aren't available), and make an informed decision about whether to proceed.

Human-in-the-loop done right

The hardest design problem in AI-assisted operations isn't the AI part. It's the approval model. Get it wrong in either direction and the system is useless:

  • Too granular — "Approve every command" turns the SRE into a button-clicker. You're not saving time; you're adding friction. The SRE spends more time reading approval prompts than they would just running the commands themselves.

  • Too autonomous — "Fully automated remediation" sounds great until the AI scales your database to 64 cores at 3am because it misread a latency spike. Now you have a $40,000 cloud bill and a different kind of incident.

We chose a middle path: one-shot approval for the remediation scope. The SRE approves the what and the boundaries, not every individual command. "Yes, scale the memory limit on this deployment, up to 2Gi, with automatic rollback if health checks fail." That's one approval covering the entire remediation.

The system re-gates — asks for approval again — only when risk escalates beyond what you originally approved. If the blast radius changes, if the action is more destructive than planned, or if something unexpected comes up during execution, Nova stops and asks. Otherwise, it proceeds within the scope you approved.

This maps to how incident response actually works. The Incident Commander doesn't approve every kubectl command. They approve the remediation strategy and let the operations lead execute it, stepping in only when circumstances change.

Event-sourced audit trail

Every action in a Nova investigation is an immutable event. Not a log line — a structured, typed event with a timestamp, an actor, and an organization context. The investigation's history isn't a mutable state object that gets updated; it's an append-only sequence of events that can be replayed from any point.

This has three consequences that matter for SRE teams:

Postmortems generate themselves. The event log is the postmortem timeline. What was checked, what was found, what hypotheses were formed and refined, what was approved, what was executed, what the outcome was. Every fact in the postmortem traces back to a specific event. No more reconstructing the timeline from memory and Slack messages three days later.

Compliance exports are a projection. If you need to prove to an auditor that every production change was approved by a human, that every action was logged, that every decision has a trace — it's a query over the event log. Not a separate compliance system bolted on after the fact.

Shift handoffs are seamless. When you hand off an investigation at shift change, the next SRE doesn't get a summary of "here's where we are." They get the full investigation — every piece of evidence, every hypothesis, every decision, every action. They can pick up exactly where you left off, with full context. The investigation survives tab refreshes, network disconnections, and even server restarts.

Multi-model consensus

For critical diagnoses, Nova doesn't trust a single model.

Large language models are powerful but unreliable in specific, hard-to-predict ways. A model might confidently assert that an OOM kill was caused by a memory leak when the actual cause was a resource limit that was never updated after a recent deploy. The model's reasoning sounds right. The confidence is high. But the conclusion is wrong.

For high-confidence hypotheses — the ones that would drive remediation decisions — Nova validates across multiple LLM providers. Think of it as a built-in second opinion. If Claude says "memory limit is too low" and GPT-4 agrees based on the same evidence, confidence goes up. If they disagree, confidence goes down, and Nova gathers more evidence before recommending action.

This isn't about using the "best" model. It's about recognizing that different models have different failure modes, and cross-checking reduces the chance that a single model's blind spot drives a wrong remediation. When you're about to change production infrastructure at 2am, a second opinion is worth the extra few seconds.

Graceful degradation with transparency

Real investigations don't happen in ideal conditions. Maybe your team hasn't connected Datadog yet. Maybe the GitHub integration token expired last week. Maybe the Kubernetes API is slow because the cluster is under load from the very incident you're investigating.

Nova is designed to work with whatever is available and be transparent about what's missing.

If a monitoring tool isn't connected, Nova doesn't fail — it continues the investigation, shows you exactly which steps were skipped and why, and tells you how that affects its confidence. "Investigation confidence: 0.75 — capped because Datadog metrics are unavailable. Connect Datadog for memory trend analysis."

This matters because it turns tool gaps from invisible blind spots into explicit, actionable information. Instead of wondering "did the AI check the metrics?" you know it didn't, you know why, and you know what to do about it. That's the difference between trusting a system and hoping it works.

What's next

We're building this in the open because we think the problems — human-in-the-loop approval, event-sourced audit trails, multi-agent coordination, graceful degradation — are industry-wide problems that benefit from transparency about approaches and trade-offs.

If you're an SRE team dealing with investigation fatigue — the 2am pages, the repetitive diagnostics, the postmortems assembled from memory — we'd love to talk. We're at contact@astropulse.io, and Nova is available to try here.

The investigation engine is one piece of a larger platform engineering system. Nova also handles day-to-day infrastructure questions, deployment planning, and operational knowledge — but the investigation loop is where the trust model matters most, and where we think the industry has the most room to improve.