One AI, Every Interface
Platform teams carry operational knowledge that doesn't transfer easily. The debugging instincts, the service interdependencies, the deployment quirks — they accumulate over years and live in a small number of people's heads. When those people are unavailable, the gap shows.
We built Nova to encode that operational knowledge into a queryable system. This post covers what the architecture looks like and what we learned building AI that actually operates infrastructure.
How investigation works
Most AI tools will give you a checklist. Nova runs an actual investigation. Here's what that looks like in the terminal:
$ astroctl nova
> My payments service is returning 5xx errors
Nova is investigating...
┌─ Hypothesis Formed
│ OOMKilled pods — memory limits at 256Mi while avg usage is 380Mi
│
├─ Evidence Collected
│ 3/5 pods restarted in last 10 min. Memory: avg 380Mi, peak 512Mi
│ Last deploy: 2h ago (payments-service:v2.14.3 → v2.15.0)
│ v2.15.0 added in-memory session cache — no corresponding limit bump
│
├─ Hypothesis Revised
│ Initial theory confirmed. The deploy introduced memory regression,
│ not a traffic spike. Limits need updating.
│
└─ Remediation Proposed (Risk: LOW)
Increase memory limits to 512Mi — matches observed peak usage
kubectl patch deploy payments-service -n prod \
-p '{"spec":{"template":{"spec":{"containers":[{"name":"payments",
"resources":{"limits":{"memory":"512Mi"}}}]}}}}'
✓ Patch applied — rolling restart in progress
The investigation pipeline follows the same structure an experienced SRE would: classify, gather evidence, hypothesize, test, and remediate. But the part that actually matters is backtracking. When evidence contradicts a hypothesis, Nova drops it and tries the next path. We tuned the confidence scoring to be asymmetric — confirming evidence nudges the score up; contradicting evidence tanks it. Confirmation bias is the most dangerous failure mode in debugging, and we'd rather Nova abandon a promising theory early than chase it through three more tool calls.
The investigations are backed by playbooks for common failure modes — OOM, cert expiry, DNS, deployment regressions, queue backlogs. Not rigid scripts, more like decision trees that encode the debugging instincts your best SRE carries around. The kind of institutional knowledge that usually walks out the door when someone leaves.
Skills
We hardcoded integrations at first. A Kubernetes module. A Slack module. A GitHub module. The first five were fine. By integration fifteen, we were spending more time maintaining connectors than building the actual AI.
So we rebuilt the whole thing around a skill abstraction. Each skill is a self-contained unit — declares what it can do, what permissions it needs, what risk level each action carries. Nova discovers the right skills at runtime. You say "post an RCA to the incidents channel" and it resolves without routing logic.
Skills compose naturally. "Check the failing pods, estimate the cost impact, and post a summary to Slack" is three skills chained in one request — Nova figures out the execution order from context. New skills ship without touching the core engine, and if you have custom tooling, you can connect it through MCP.
The authorization model is deterministic. Reads execute immediately. Writes that aren't destructive stream a confirmation. Anything dangerous — deployments, deletions, config changes — blocks until you explicitly approve. This isn't in the prompt. It's hard-coded middleware that the model can't bypass.
One constraint we committed to early: the LLM never sees credentials. Tokens get injected at execution time, outside the context window. If a credential enters the context, it can leak through outputs, logs, or prompt injection. We made it structurally impossible.
The interface problem
We built Nova Cloud first and assumed the CLI would be an afterthought. Then we watched how incidents actually play out.
An alert fires in Slack. Someone opens a terminal. They check a dashboard. They go back to Slack to update the team. Four tools, zero shared state. We realized the multi-interface story isn't a feature — it's the whole point.
Nova runs on one backend. Conversations, connected skills, and investigation state persist across browser and terminal, with related approval and notification flows connected around them. Start in the browser, pick it up from the terminal, and route results into the channels your team already watches.
The CLI matters most for session continuity. You're deep in a debug session, get pulled into a meeting, come back an hour later — astroctl nova --continue picks up where you left off. Full investigation state, evidence collected, hypotheses tested. The terminal renders investigation blocks natively, supports slash commands with tab completion, handles structured paste for YAML. It's not a stripped-down web app. It's an operator tool.
Slack-connected workflows are part of the surrounding operator loop: approvals, notifications, and team visibility in the channels people already use.
What still isn't great
Investigation playbooks work well for known failure modes. For genuinely novel issues — things the playbooks don't cover — Nova falls back to general reasoning, which is slower and less reliable. We're getting better at this but it's still the weakest link.
Multi-step plans sometimes over-commit. Nova will propose a five-step remediation when a single command would do. The plan critic catches the worst of these, but "technically correct but unnecessarily complex" plans still get through more than we'd like.
And context compaction — summarizing long conversations to fit the context window — sometimes loses details that matter. We've built guardrails around this (explicit failure rather than silent truncation), but the fundamental tradeoff between context length and information density is unsolved. It's an industry-wide problem, not just ours, but it affects investigation quality on long-running sessions.
Deployment
There's no single right deployment model for infrastructure AI:
- Nova Cloud — Fully managed. Sign in, connect your infrastructure, start asking questions.
- Nova Terminal —
astroctl novafrom your shell. Same backend, operator-native interface. - Nova Direct — Self-managed via Docker Compose. Your data, your models, air-gapped if you need it.
- Nova Connect — Nova as a hosted remote MCP server for Claude Code, Cursor, VS Code, Claude Desktop, and similar OAuth-capable clients.
The deeper engineering story — the problems we solved building this at scale — is in The Hardest Problems in Building Production AI Agents.
Try it: Open Nova → for the browser, curl -fsSL https://astropulse.io/install.sh | bash for the CLI, or Nova Connect for your editor.