How We Designed Nova's Investigation Engine: Lessons from SRE at Scale
When something breaks in production at an odd hour, the person on call has to do three things at once: understand what is happening, decide what to do about it, and be able to explain all of it the next day. Most AI incident tools help with at most one of these. They either give you more data to read, or they take action you cannot see and cannot account for afterward.
We spent the last several months building Nova's investigation engine around that gap. This post is about how we designed it, the models we borrowed from, and the trade-offs we made along the way.
Two ways AI incident tools fall short
Most of them sit in one of two places. Some connect to your monitoring stack, pull the relevant data, and present it more cleanly than the raw dashboards do. That is genuinely useful, but you are still the one who has to work out what went wrong. Others detect an anomaly, run an automation, and report that the problem is handled. Sometimes it is. The trouble is that you cannot see what was checked, what was ruled out, or why that particular fix was chosen, which is exactly what you need in order to write an honest postmortem.
Neither is enough for production work. During an incident you need to understand what is happening, stay in control of what happens next, and be able to explain the whole sequence later. The system has to work with the engineer, not in place of them.
An operator loop, not a chat session
We built the engine around a loop that has little to do with AI and a lot to do with how reliable operations already work. Three models shaped it:
- Google SRE's Incident Commander model, where one coordinator makes decisions and specialists carry them out. The commander does not run commands directly.
- Kubernetes operator reconciliation, a control loop that observes the current state, compares it to the desired state, and acts. It is idempotent, resumable, and safe to interrupt.
- Medical differential diagnosis, where you form hypotheses, test the most likely ones first, prefer the least invasive checks, and get consent before any procedure.
The loop looks like this:
┌──────────────────────────────────────────────────┐
│ │
│ INVESTIGATE ──► PLAN ──► APPROVE ──► EXECUTE │
│ ▲ │ │
│ │ ▼ │
│ │◄──────────── OBSERVE ◄───────────┘ │
│ │
└──────────────────────────────────────────────────┘
Investigate gathers evidence: pod status, recent deploys, logs, metrics. Several checks run in parallel, and the engine decides what to look at next based on what it has already found, rather than following a fixed script. Plan turns that evidence into ranked hypotheses, each with a confidence level and the evidence that supports or contradicts it, and it stops once the evidence points clearly to a cause. Approve is where the engineer signs off on the remediation scope. Execute applies the change, and every action is recorded. Observe verifies the result and loops back if the fix did not hold.
None of this is a new architecture. It is close to how an experienced SRE already works. What we built is a system that does the tedious parts, gathering evidence, correlating it, and tracking which hypotheses still stand, while leaving the decisions that carry risk to a person.
Show the plan before touching anything
Before Nova makes any change to your systems, it shows you the plan. This is the single decision that shaped the rest of the design.
Most agent frameworks default to think-then-act: the model decides and does, and you see the result afterward. That is fine for answering a question. It is the wrong default for a system that can run commands against production. So the plan is explicit about what will run and in what order, which tools each step needs, which integrations are missing and what that costs you, which steps are read-only and which require approval, and how confident the engine is in each hypothesis.
The plan is not just a status update. It is the engineer's control surface. You can see what Nova intends to do, understand the trade-offs when some tools are unavailable, and decide whether to proceed.
The approval model is the hard part
The difficult problem in AI-assisted operations is not the reasoning. It is the approval model, and it fails in both directions.
Approve every command and the engineer becomes a button-clicker, reading prompts instead of fixing the problem. That adds friction without adding safety. Hand the system full autonomy and a single misread signal can lead it to make a large, expensive change while no one is watching, which is its own kind of incident.
We chose a middle path: one approval for the remediation scope. The engineer approves the change and its boundaries, not each individual command. Scale the memory limit on this deployment, up to a set ceiling, and roll back automatically if health checks fail. That one approval covers the remediation. The engine asks again only when risk rises above what was approved: if the blast radius grows, if an action is more destructive than planned, or if something unexpected appears during execution. This is how incident response already works. The Incident Commander approves the strategy and lets the operations lead carry it out, stepping in when circumstances change.
An audit trail by construction
Every step in an investigation is an immutable, typed event with a timestamp, an actor, and an organization context. The investigation's history is not a state object that gets overwritten. It is an append-only sequence that can be replayed from any point. Three things follow from that.
The event log is the postmortem timeline. What was checked, what was found, which hypotheses were formed and revised, what was approved, what was executed, and how it turned out. Each fact traces back to a specific event, so the timeline does not have to be reassembled from memory and Slack three days later.
A compliance export becomes a query over that log rather than a separate system bolted on afterward. If you need to show that every production change was approved by a person and every action recorded, the record is already there.
Shift handoffs carry the full investigation, not a summary. The next engineer picks up every piece of evidence, every hypothesis, and every decision, and the investigation survives tab refreshes, disconnections, and server restarts.
A second opinion on the calls that matter
Language models are capable and also unreliable in ways that are hard to predict. A model can state with high confidence that an out-of-memory kill came from a memory leak when the real cause was a resource limit that was never raised after a deploy. The reasoning reads well and the confidence is high, and the conclusion is still wrong.
For the hypotheses that would drive a remediation, Nova does not rely on a single model. It checks the conclusion against a second, independent model on the same evidence. When they agree, confidence rises. When they disagree, confidence falls, the result does not advance to a decision, and the engine gathers more evidence first. This is not about picking the best model. Different models fail in different places, and a second read lowers the odds that one model's blind spot drives a change to production.
Working with what is actually connected
Real investigations do not happen in ideal conditions. A team may not have connected its metrics provider yet. A token may have expired last week. The cluster API may be slow because of the very incident under investigation.
Nova works with whatever is available and is explicit about what is not. If a source is missing, the investigation continues, shows which checks were skipped and why, and reflects the gap in its confidence rather than hiding it. A capped confidence with a note that a metrics provider is unavailable is more useful than a clean number that quietly ignored half the picture. It turns a blind spot into something the engineer can act on.
Where this is going
We are designing the investigation engine in the open because the hard parts, approval that a person controls, a record you can audit, coordinating many checks at once, and reasoning from incomplete data, are problems the whole field shares. Because every investigation is recorded, the engine also improves over time: causes that proved right or were ruled out on past incidents shape how it approaches the next one.
If your team lives through the late-night pages and the postmortems pieced together from memory, we would like to hear from you. You can reach us at contact@astropulse.io, and you can try Nova at astropulse.io.
The investigation loop is one part of a larger platform engineering system, and it is the part where the trust model matters most.