Skip to main content

2 posts tagged with "Incident Response"

Incident investigation, response, and postmortems

View All Tags

The AI SRE Race Is Running the Wrong Way

· 7 min read
Rajesh RC
Founder

AI diagnosis flowing through a governed approval gate into production infrastructure

The thesis

Diagnosis is a commodity. Trust is the product.

The AI SRE race will not be won by the agent that diagnoses fastest. It will be won by the system that operators trust enough to grant write access.

A personal note on where AI for operations is actually heading.

Over the past year a new category filled up fast. Depending on how you count, there are now more than a dozen credible tools that call themselves AI SREs. I have watched the space closely, partly because we are building in it, and partly because the speed of convergence is genuinely interesting.

Here is what nearly all of them do. They connect to your telemetry, your code, and your incident tooling. They correlate logs, metrics, and traces. When an alert fires, they form hypotheses, test them against the evidence, and post a likely root cause into Slack, often in under a minute. This is real progress. A few years ago none of it worked. Today most of it does.

Phase one is real

Diagnosis is real progress. But it is just phase one.

How We Designed Nova's Investigation Engine: Lessons from SRE at Scale

· 8 min read
Rajesh RC
Founder

When something breaks in production at an odd hour, the person on call has to do three things at once: understand what is happening, decide what to do about it, and be able to explain all of it the next day. Most AI incident tools help with at most one of these. They either give you more data to read, or they take action you cannot see and cannot account for afterward.

We spent the last several months building Nova's investigation engine around that gap. This post is about how we designed it, the models we borrowed from, and the trade-offs we made along the way.