Skip to main content

One AI, Every Interface

· 6 min read
Rajesh RC
Founder

Platform teams carry operational knowledge that does not transfer easily. The debugging instincts, the service interdependencies, the deployment quirks: they accumulate over years and live in a few people's heads. When those people are unavailable, the gap shows.

We built Nova to put that knowledge into a system you can query. This post covers what the architecture looks like and what we learned building AI that actually operates infrastructure.

How investigation works

Most AI tools hand you a checklist. Nova runs an investigation. Here is what that looks like in the terminal:

$ astroctl nova
> My payments service is returning 5xx errors

Nova is investigating...

┌─ Hypothesis Formed
│ OOMKilled pods - memory limits at 256Mi while avg usage is 380Mi

├─ Evidence Collected
│ 3/5 pods restarted in last 10 min. Memory: avg 380Mi, peak 512Mi
│ Last deploy: 2h ago (payments-service:v2.14.3 → v2.15.0)
│ v2.15.0 added in-memory session cache - no corresponding limit bump

├─ Hypothesis Revised
│ Initial theory confirmed. The deploy introduced memory regression,
│ not a traffic spike. Limits need updating.

└─ Remediation Proposed (Risk: LOW)
Increase memory limits to 512Mi - matches observed peak usage
kubectl patch deploy payments-service -n prod \
-p '{"spec":{"template":{"spec":{"containers":[{"name":"payments",
"resources":{"limits":{"memory":"512Mi"}}}]}}}}'

✓ Patch applied - rolling restart in progress

The pipeline follows the structure an experienced SRE would: classify, gather evidence, form hypotheses, test them, and propose a fix. The part that matters most is backtracking. When evidence contradicts a hypothesis, Nova drops it and moves on. We made the confidence scoring asymmetric: confirming evidence nudges the score up, and contradicting evidence pulls it down hard. Confirmation bias is the most dangerous failure mode in debugging, and we would rather Nova abandon a promising theory early than chase it through three more tool calls.

Nova starts from what it knows about common failure modes, such as OOM kills, cert expiry, DNS, deployment regressions, and queue backlogs, as priors rather than a fixed script. They seed the first hypotheses, and the evidence decides where the investigation goes from there. That is the kind of institutional knowledge that usually walks out the door when an experienced engineer leaves.

Skills

We hardcoded integrations at first: a Kubernetes module, a Slack module, a GitHub module. The first five were fine. By the fifteenth, we were spending more time maintaining connectors than building the AI.

So we rebuilt around a skill abstraction. Each skill is self-contained: it declares what it can do, what permissions it needs, and the risk level of each action. Nova discovers the right skills at runtime. You say "post an RCA to the incidents channel" and it resolves without routing logic.

Skills compose. "Check the failing pods, estimate the cost impact, and post a summary to Slack" is three skills chained in one request, and Nova works out the execution order from context. New skills ship without touching the core engine, and you can connect custom tooling through MCP.

The authorization model is deterministic. Reads execute immediately. Writes that are not destructive stream a confirmation. Anything dangerous, such as deployments, deletions, and config changes, blocks until you approve it. This is not in the prompt. It is hard-coded middleware the model cannot bypass.

One constraint we committed to early: the model never sees credentials. Tokens are injected at execution time, outside the context window. If a credential enters the context, it can leak through outputs, logs, or prompt injection, so we made that structurally impossible.

The interface problem

We built Nova Cloud first and assumed the CLI would be an afterthought. Then we watched how incidents actually play out.

An alert fires in Slack. Someone opens a terminal. They check a dashboard. They go back to Slack to update the team. Four tools, no shared state. The multi-interface story is not a feature on top of Nova. It is the point of it.

Nova runs on one backend. Conversations, connected skills, and investigation state persist across browser and terminal, with the related approval and notification flows connected around them. Start in the browser, pick it up in the terminal, and route results into the channels your team already watches.

The CLI matters most for session continuity. You are deep in a debug session, get pulled into a meeting, and come back an hour later. astroctl nova --continue picks up where you left off, with the full investigation state, the evidence collected, and the hypotheses tested. The terminal renders investigation blocks natively, supports slash commands with tab completion, and handles structured paste for YAML. It is an operator tool, not a stripped-down web app.

Slack-connected workflows are part of the surrounding loop: approvals, notifications, and team visibility in the channels people already use.

What still isn't great

When an incident matches a familiar failure mode, Nova has strong priors and moves quickly. For genuinely novel issues it has less to draw on and reasons more from first principles, which is slower and less reliable. We are getting better at it, and it is still the weakest link.

Multi-step plans sometimes over-commit. Nova will propose a five-step remediation when one command would do. The plan critic catches the worst of them, but plans that are technically correct and unnecessarily complex still get through more often than we would like.

Context compaction, summarizing a long conversation to fit the context window, sometimes drops details that matter. We guard against the worst case by failing explicitly rather than truncating silently, but the trade-off between context length and information density is unsolved. It is an industry-wide problem, and it affects investigation quality on long sessions.

Deployment

There is no single right deployment model for infrastructure AI.

  • Nova Cloud: fully managed. Sign in, connect your infrastructure, and start asking questions.
  • Nova Terminal: astroctl nova from your shell. Same backend, operator-native interface.
  • Nova Direct: self-managed with Docker Compose. Your data, your models, air-gapped if you need it.
  • Nova Connect: Nova as a hosted remote MCP server for Claude Code, Cursor, VS Code, Claude Desktop, and similar OAuth-capable clients.

The deeper engineering story, the problems we solved building this at scale, is in The Hardest Problems in Building Production AI Agents.

Try it: Open Nova for the browser, curl -fsSL https://astropulse.io/install.sh | bash for the CLI, or Nova Connect for your editor.