The Hardest Problems in Building Production AI Agents
Every AI agent demo looks the same. The model calls a tool, gets a result, responds. Ship it. Then you try to run it against real infrastructure — and the demo falls apart in ways nobody warned you about.
We've spent over a year building Nova, an AI agent that operates real infrastructure for real teams. Not a chatbot that wraps API calls, but a system that investigates incidents, executes remediations, and composes across dozens of integrations. This post is about what we learned — the problems that made us rebuild entire subsystems, and the patterns that survived.
Why we built this
Platform engineering doesn't scale linearly. A four-person platform team supporting 200 engineers means every "quick question" — why are the pods crashing, which Terraform module handles that, who owns this service — pulls someone away from actual work. The answers exist. They're in runbooks, git history, Datadog dashboards, and the muscle memory of two people who've debugged this system for years.
The problem is that knowledge is inaccessible when it matters most. At 3am during an incident, you don't have time to cross-reference three wikis and page the one person who remembers how the deployment pipeline was wired. Nova is our attempt to make that knowledge accessible without requiring a human to assemble it under pressure every time.
The hard problems
Tool selection doesn't scale the way you think
The first thing every agent framework gets wrong is tool loading. You define 20 tools, stuff their schemas into the system prompt, and it works. Scale to 200 tools and you've burned 80k tokens on tool descriptions alone — most of your context window gone before the user says anything.
The obvious answer is "only load relevant tools." The non-obvious part is that relevance changes mid-conversation. A user starts asking about 5xx errors — you load Kubernetes tools. Ten turns later, the investigation points to a bad deploy, and now you need GitHub tools that weren't loaded. You can't predict the full tool set upfront because the model's reasoning path isn't linear.
We use a hybrid retrieval approach: semantic similarity against tool descriptions plus keyword matching against the conversation history, fused into a ranked list. The query gets rewritten into a standalone form first (stripping conversational filler and resolving pronouns), then both retrieval signals are combined with reciprocal rank fusion.
The hardest part is the confidence threshold. Set it too high and you miss tools the model needs — it ends up saying "I can't help with that" when it absolutely could if it had the right tool loaded. Set it too low and you're back to context bloat with irrelevant tools competing for attention. We spent months tuning this and the honest answer is it's still not perfect. The failure mode we optimize against is false negatives — missing a tool is worse than including an extra one.
Context engineering is the real bottleneck
Every agent tutorial shows a clean three-turn conversation. In production, sessions run long. A debugging investigation might involve 15-20 tool calls, each returning kilobytes of JSON. Kubernetes pod descriptions, log snippets, Datadog metrics, deployment history. After a few rounds, you've consumed most of your context window on tool outputs.
The naive approach is to just use bigger context windows. This helps but doesn't solve the problem — attention quality degrades over length even when the window fits. We use a multi-tier compaction strategy: recent messages stay verbatim, older turns get summarized while preserving key facts (resource names, error codes, metrics values), and tool outputs get aggressively trimmed to only the fields that matter.
The design principle we committed to: explicit failure over silent information loss. When compaction drops something that might matter, the system surfaces it — "earlier in this conversation you discussed X, but I no longer have the full details." The worst outcome isn't missing information. It's the model confidently answering from stale or incomplete context because something was silently dropped. We'd rather Nova say "I'm not sure, let me re-check" than confabulate from a lossy summary.
This is still our biggest unsolved problem. We've rebuilt it three times and it's still the weakest part of the product. It's an industry-wide problem — nobody has cracked long-session context management in a way that doesn't lose information or tank latency.
Multi-provider abstraction is a minefield
We support Anthropic, OpenAI, Google, and self-hosted models through Ollama. Every provider has a different tool calling format. Anthropic uses a tool_use content block; OpenAI uses function in the message with a separate tool_calls array; Google has its own function declaration format. Even when the schemas look similar, the edge cases diverge.
Some real examples: Claude handles optional parameters gracefully — if you don't mark a field as required, it might omit it. GPT-4 tends to send explicit null values for optional fields, which breaks tools that check for key existence instead of value truthiness. Token counting differs per provider — Anthropic's tokenizer isn't the same as OpenAI's tiktoken, so your token budget math is wrong if you assume a universal count. Caching strategies are completely different — Anthropic has ephemeral prompt caching; OpenAI doesn't have an equivalent; you need to build your own layer.
We built a normalization layer that translates between provider formats, handles token budget estimation per provider, and standardizes error responses. It's not glamorous engineering but it prevents a whole class of "works on Claude, breaks on GPT-4" bugs. And it lets users choose their model without us rewriting tool definitions for each provider.
The LLM/tool boundary is an untrusted API
This is the problem nobody talks about at conferences because it's boring. The model generates parameters for a tool call, the tool executes and returns a result. In a demo, this always works. In production:
- The model passes
"5"as a string when the tool expects an integer - Tool output includes a bearer token in a response header — now it's in the context window, potentially leakable through model outputs or logs
- The model calls the same failing tool four times in a row, burning through rate limits and token budget
- A Kubernetes API returns a 206 partial response and the model treats it as complete
- The model invents tool parameters that don't exist, especially for tools with complex schemas
We treat every boundary crossing — model to tool, tool to model — the same way you'd treat an untrusted external API. Input validation on tool parameters. Output sanitization to scrub credentials and PII from tool responses before they enter context. Circuit breakers for repeated failures. Budget tracking per tool call. These aren't interesting engineering problems individually, but collectively they're the difference between a demo and a product.
Evaluation is how you build trust
"It just does stuff" isn't acceptable when the agent can kubectl delete in production. We needed a way to verify that Nova's actions are safe, efficient, and actually helpful — without requiring a human to review every action.
The plan critic is the most impactful. Nova tends to over-commit — proposing five-step remediations when a single command would do. The critic catches "technically correct but unnecessarily complex" plans and forces simplification. It's not perfect (some verbose plans still get through) but it dramatically reduced the "why did it do all that?" complaints.
Trajectory eval runs asynchronously after each session. It scores the full sequence of actions and feeds patterns back into the system. Over time, this creates a self-improving loop — failure modes that get flagged in trajectory eval get caught earlier by the plan critic in future interactions.
Skills, not integrations
We hardcoded integrations at first. A Kubernetes module. A Slack module. A GitHub module. The first five were manageable. By integration fifteen, we were spending more time maintaining connectors than building the actual AI. Every new integration needed auth handling, error handling, output formatting, rate limiting — the same boilerplate every time, but different enough that you couldn't just template it.
So we rebuilt around a skill abstraction. A skill is a declarative definition: what the skill can do, what permissions each action needs, and what risk level each action carries. The execution layer underneath handles how to connect — REST, GraphQL, CLI, or MCP. This separation is the key insight: the model reasons about what to do, the skill defines the contract, and the executor handles the plumbing.
Skills compose naturally. "Check the failing pods, estimate the cost impact, and post a summary to Slack" is three skills chained in one request. Nova figures out the execution order from context — no explicit orchestration needed. New skills ship without touching the core engine, and teams can bring custom tooling through MCP.
The authorization model is deterministic middleware, not a system prompt instruction. Read operations execute immediately. Non-destructive writes stream a confirmation. Anything dangerous — deployments, deletions, scaling operations — blocks until explicit user approval. The model cannot talk its way past the safety layer because the safety layer doesn't consult the model.
One constraint we committed to early: the LLM never sees credentials. Tokens are injected at execution time by the runtime, completely outside the context window. The model sees a tool call like "post to Slack channel #incidents" but never the bearer token. If credentials enter the context, they can leak through model outputs, logging pipelines, or prompt injection attacks. We made it structurally impossible by keeping credential resolution in a separate layer that the model has no access to.
Guardrails sit on top of this. Each action category carries a policy: allow, require_approval, or block. Policies are set at the org level and can be overridden per user — an operator can approve destructive actions that are blocked for everyone else. Every execution, approval, and block is written to an append-only audit log. When a new skill with write operations is first connected, it defaults to require_approval until an admin explicitly widens the policy. The model cannot escalate its own permissions — that path doesn't exist.
Three interfaces, one engine
We built the web interface first and assumed terminal and Slack would be afterthoughts. Then we watched how incidents actually play out. An alert fires in Slack. Someone opens a terminal. They check a dashboard. They go back to Slack to update the team. Four tools, zero shared state. The multi-interface story isn't a feature — it's the whole point.
All three interfaces hit the same backend — same skill access, same authorization, same conversation state. Start an investigation in the browser, continue from the terminal with astroctl nova --continue, get results posted to Slack. The terminal isn't a stripped-down web app; it renders investigation blocks natively, supports slash commands with tab completion, and handles structured paste for YAML manifests. Slack isn't a "lite" version — it's the full engine embedded in your team's communication flow, with interactive approval buttons in-thread.
The hardest part was the rendering layer. Each surface has different constraints — the browser can render rich interactive components, the terminal needs ANSI-based block rendering, and notification or approval surfaces have their own layout rules. We built a block protocol that the engine outputs, and each client translates blocks into its native format. Same data, different renderers.
What we got wrong
We over-engineered the routing layer. Early on, we built a classifier to route different query types (debugging vs. deployment vs. Q&A) to different execution paths with different tool sets and system prompts. It added a lot of complexity. Then we realized the LLM itself is better at deciding what to do next than our hand-tuned router. We ripped out the classifier and went with a single unified flow where the model picks its own path. Simpler, and actually more reliable.
We underestimated context compaction. We thought a basic summarization pass would be enough. It wasn't even close. A single Kubernetes investigation can generate 30k+ tokens of tool output in five turns. Naive summarization drops the specific pod names, error codes, and timestamps that the model needs to reason correctly. We've rebuilt this system three times and it's still not where we want it.
We assumed external APIs would be reliable. They're not. Datadog rate-limits aggressively during incidents (exactly when you need it most). AWS APIs return partial results that look complete. GitHub's API occasionally changes response schemas without versioning the endpoint. Early on, a single flaky API call would derail an entire investigation because we didn't have circuit breakers or fallback strategies. Now we treat every external call as potentially failing and plan accordingly.
We spent too long on prompt engineering instead of systems engineering. Our first instinct for every problem was to fix the prompt. Bad tool selection? Better system prompt. Verbose responses? Add "be concise" to the instructions. This works in demos but doesn't scale. The problems that actually matter — tool scaling, context management, authorization, evaluation — are all systems problems. The prompt is important, but it's maybe 20% of what makes a production agent work.
Where we are now
Nova now runs across browser and terminal, with Slack-connected approval and notification flows around the same operational model. Investigation playbooks cover common failure modes — OOM, cert expiry, DNS, deployment regressions, queue backlogs — and the general reasoning path handles much of the rest. The plan critic catches many over-engineered proposals, trajectory eval improves quality over time, and the authorization gateway is designed so dangerous actions require explicit approval.
The problems that remain are the hard ones. Context compaction is still lossy on long sessions. Novel failure modes fall back to general reasoning, which is slower. Multi-step plans still occasionally over-commit. We're working on all of these — some are engineering problems we can solve directly, some are research problems where we expect the underlying models to get substantially better. The architecture is in place. These are improvements we can ship without rethinking the foundation.
If you're building production AI agents and running into similar problems, we'd love to compare notes. The patterns we've found — hybrid retrieval for tool selection, explicit failure over silent loss, deterministic authorization middleware, multi-stage evaluation — aren't novel individually. Making them work together at production scale, reliably, is where the real engineering happens.
Try Nova →. The deeper story on how Nova works across interfaces is in One AI, Every Interface. If these problems sound interesting, we're hiring.