Skip to main content

4 posts tagged with "Nova Indexed"

Content indexed by Nova for search_docs tool. Add this tag to technical blogs.

View All Tags

The AI SRE Race Is Running the Wrong Way

· 7 min read
Rajesh RC
Founder

AI diagnosis flowing through a governed approval gate into production infrastructure

The thesis

Diagnosis is a commodity. Trust is the product.

The AI SRE race will not be won by the agent that diagnoses fastest. It will be won by the system that operators trust enough to grant write access.

A personal note on where AI for operations is actually heading.

Over the past year a new category filled up fast. Depending on how you count, there are now more than a dozen credible tools that call themselves AI SREs. I have watched the space closely, partly because we are building in it, and partly because the speed of convergence is genuinely interesting.

Here is what nearly all of them do. They connect to your telemetry, your code, and your incident tooling. They correlate logs, metrics, and traces. When an alert fires, they form hypotheses, test them against the evidence, and post a likely root cause into Slack, often in under a minute. This is real progress. A few years ago none of it worked. Today most of it does.

Phase one is real

Diagnosis is real progress. But it is just phase one.

How We Designed Nova's Investigation Engine: Lessons from SRE at Scale

· 8 min read
Rajesh RC
Founder

When something breaks in production at an odd hour, the person on call has to do three things at once: understand what is happening, decide what to do about it, and be able to explain all of it the next day. Most AI incident tools help with at most one of these. They either give you more data to read, or they take action you cannot see and cannot account for afterward.

We spent the last several months building Nova's investigation engine around that gap. This post is about how we designed it, the models we borrowed from, and the trade-offs we made along the way.

Bring Your Own Kubernetes Cluster

· 5 min read
Rajesh RC
Founder

Six months in as platform lead and you have a spreadsheet you haven't shown your manager. Eleven Kubernetes clusters. EKS for production. GKE for the ML team. Two on-prem clusters behind the firewall that predate your tenure. A handful of Kind clusters developers spun up locally. Each one has its own deployment pipeline, its own credentials rotation process, its own way of answering "is this service healthy?"

Your team isn't building features anymore. You're maintaining eleven slightly different versions of the same tooling.

astroctl infra k8s register --name my-cluster

One AI, Every Interface

· 6 min read
Rajesh RC
Founder

Platform teams carry operational knowledge that does not transfer easily. The debugging instincts, the service interdependencies, the deployment quirks: they accumulate over years and live in a few people's heads. When those people are unavailable, the gap shows.

We built Nova to put that knowledge into a system you can query. This post covers what the architecture looks like and what we learned building AI that actually operates infrastructure.