The Hardest Problems in Building Production AI Agents

May 24, 2026 · 25 min read

Founder

Every AI agent demo looks the same. The model calls a tool, gets a result, and responds. Then you run it against real infrastructure, and the demo falls apart in ways the tutorials never mention.

We have spent over a year building Nova, an AI agent that operates real infrastructure for real teams. It is not a chatbot that wraps API calls. It investigates incidents, executes remediations, and composes across dozens of integrations. This post is about what we learned: the problems that made us rebuild entire subsystems, and the patterns that survived.

Why we built this

Platform engineering does not scale linearly. A four-person platform team supporting 200 engineers means every quick question, such as why the pods are crashing, which Terraform module handles that, or who owns this service, pulls someone away from real work. The answers exist. They are in runbooks, git history, Datadog dashboards, and the muscle memory of two people who have debugged this system for years.

That knowledge is hardest to reach when it matters most. In the middle of an incident, you do not have time to cross-reference three wikis and page the one person who remembers how the deployment pipeline was wired. Nova is our attempt to make that knowledge accessible without a human having to assemble it under pressure every time.

Nova · Request Pipeline

1. Ingest

Artifacts · History

2. Route

Model Selection

3. Context

Compaction · Memory

4. Tools

Search · Ranking

5. Stream

Execute · Respond

LLM Layer

Multi-Provider Routing

Provider-Specific Caching

Token Budget Management

Intelligence Layer

Dynamic Tool Selection

Context Compaction

Conversation Memory

Evaluation Layer

Pre-Execution Plan Review

Post-Execution Scoring

Response Quality Monitoring

Skills Execution

Slack

GitHub

AWS

Kubernetes

Datadog

Terraform

+ more via MCP

The hard problems

Tool selection doesn't scale the way you think

The first thing every agent framework gets wrong is tool loading. You define 20 tools, put their schemas in the system prompt, and it works. Scale to 200 tools and you have spent 80k tokens on tool descriptions alone, most of your context window gone before the user says anything.

The obvious answer is to load only the relevant tools. The harder part is that relevance changes mid-conversation. A user asks about 5xx errors, so you load Kubernetes tools. Ten turns later the investigation points to a bad deploy, and now you need GitHub tools that were never loaded. You cannot predict the full tool set upfront, because the model's reasoning path is not linear.

We use a hybrid retrieval approach: semantic similarity against tool descriptions plus keyword matching against the conversation history, fused into a ranked list. The query gets rewritten into a standalone form first (stripping conversational filler and resolving pronouns), then both retrieval signals are combined with reciprocal rank fusion.

Tool Selection · Dynamic Loading

Query

User intent

→

Rewrite

Standalone form

→

Hybrid Retrieval

Semantic + keyword

→

Rank + Filter

Confidence gate

→

Load

Into context

The hardest part is the confidence threshold. Set it too high and you miss tools the model needs, so it says it cannot help when it could have, given the right tool. Set it too low and you are back to context bloat, with irrelevant tools competing for attention. We spent months tuning this, and the honest answer is that it is still not perfect. We optimize against false negatives, because missing a tool is worse than including an extra one.

Context engineering is the real bottleneck

Every agent tutorial shows a clean three-turn conversation. In production, sessions run long. A debugging investigation might involve 15 to 20 tool calls, each returning kilobytes of JSON: pod descriptions, log snippets, Datadog metrics, deployment history. After a few rounds, most of the context window is gone to tool output.

The naive fix is a bigger context window. That helps, but it does not solve the problem, because attention quality degrades over length even when everything fits. We use a multi-tier compaction strategy instead: recent messages stay verbatim, older turns are summarized while preserving the key facts (resource names, error codes, metric values), and tool outputs are trimmed to only the fields that matter.

Context Pipeline · Multi-Tier Compaction

Raw Output

Full tool responses

→

Field Pruning

Drop noise, keep signal

→

Turn Summary

Compress old turns

→

Dense Context

Fits budget

The principle we committed to is explicit failure over silent information loss. When compaction drops something that might matter, the system says so: "earlier in this conversation you discussed X, but I no longer have the full details." The worst outcome is not missing information. It is the model answering confidently from stale or incomplete context because something was dropped without anyone noticing. We would rather Nova say it needs to recheck than confabulate from a lossy summary.

This is still our biggest unsolved problem. We have rebuilt it three times and it is still the weakest part of the product. It is an industry-wide problem: no one has solved long-session context management in a way that neither loses information nor tanks latency.

Multi-provider abstraction is a minefield

We support Anthropic, OpenAI, Google, and self-hosted models through Ollama. Every provider has a different tool calling format. Anthropic uses a tool_use content block; OpenAI uses function in the message with a separate tool_calls array; Google has its own function declaration format. Even when the schemas look similar, the edge cases diverge.

A few real examples. Claude handles optional parameters gracefully: if you do not mark a field as required, it may omit it. OpenAI's models tend to send an explicit null for optional fields, which breaks tools that check for key existence rather than value truthiness. Token counting differs per provider, so Anthropic's tokenizer and OpenAI's tiktoken disagree, and your budget math is wrong if you assume one universal count. Caching differs too: one provider offers ephemeral prompt caching, another has no equivalent, and you end up building your own layer.

We built a normalization layer that translates between provider formats, estimates token budgets per provider, and standardizes error responses. It is unglamorous, and it prevents a whole class of "works on one provider, breaks on another" bugs. It also lets users choose their model without us rewriting tool definitions for each one.

The LLM/tool boundary is an untrusted API

This is the problem nobody talks about at conferences because it's boring. The model generates parameters for a tool call, the tool executes and returns a result. In a demo, this always works. In production:

The model passes "5" as a string when the tool expects an integer
Tool output includes a bearer token in a response header, so now it is in the context window and potentially leakable through model outputs or logs
The model calls the same failing tool four times in a row, burning through rate limits and token budget
A Kubernetes API returns a 206 partial response and the model treats it as complete
The model invents tool parameters that don't exist, especially for tools with complex schemas

We treat every boundary crossing, model to tool and tool to model, the way you would treat an untrusted external API. Input validation on tool parameters. Output sanitization to scrub credentials and PII from tool responses before they enter context. Circuit breakers for repeated failures. Budget tracking per call. None of these is an interesting problem on its own, but together they are the difference between a demo and a product.

Evaluation is how you build trust

"It just does stuff" is not acceptable when the agent can run kubectl delete in production. We needed a way to verify that Nova's actions are safe, efficient, and genuinely helpful, without a human reviewing every one.

Evaluation · Multi-Stage Pipeline

BLOCKING

Plan Critic

Reviews proposed plans before the user sees them. Checks for unnecessary complexity, safety issues, and logical errors. Blocks plans that don't pass.

ASYNC

Trajectory Eval

Post-execution analysis of the full tool-call sequence. Did it take unnecessary steps? Did it recover from errors well? Were the right tools used?

ASYNC

Response Quality

Ongoing quality monitoring. Was the answer actually helpful? Was it concise? Did it hallucinate infrastructure state?

The plan critic is the most impactful. Nova tends to over-commit, proposing five-step remediations when a single command would do. The critic catches plans that are technically correct and unnecessarily complex and forces them to simplify. It is not perfect, and some verbose plans still get through, but it sharply reduced the "why did it do all that?" complaints.

Trajectory eval runs asynchronously after each session. It scores the full sequence of actions and feeds the patterns back into the system. Over time this creates a self-improving loop: failure modes flagged in trajectory eval get caught earlier by the plan critic on later sessions.

Skills, not integrations

We hardcoded integrations at first: a Kubernetes module, a Slack module, a GitHub module. The first five were manageable. By the fifteenth, we were spending more time maintaining connectors than building the AI. Every integration needed auth handling, error handling, output formatting, and rate limiting, the same boilerplate each time, but different enough that you could not template it.

So we rebuilt around a skill abstraction. A skill is a declarative definition: what it can do, what permissions each action needs, and what risk level each action carries. The execution layer underneath handles how to connect, whether over REST, GraphQL, CLI, or MCP. That separation is the point: the model reasons about what to do, the skill defines the contract, and the executor handles the plumbing.

Skills Execution Flow

AI Model

Decides action

→

Skill Contract

What + permissions

→

Auth Gateway

Risk classification

→

Executor

REST, CLI, MCP

→

Service

Slack, K8s, AWS...

Skills compose naturally. "Check the failing pods, estimate the cost impact, and post a summary to Slack" is three skills chained in one request, and Nova works out the execution order from context with no explicit orchestration. New skills ship without touching the core engine, and teams can bring custom tooling through MCP.

The authorization model is deterministic middleware, not a system-prompt instruction. Reads execute immediately. Non-destructive writes stream a confirmation. Anything dangerous, such as deployments, deletions, and scaling operations, blocks until the user approves it. The model cannot talk its way past the safety layer, because the safety layer never consults the model.

One constraint we committed to early is that the model never sees credentials. Tokens are injected at execution time by the runtime, completely outside the context window. The model sees a tool call like "post to Slack channel #incidents" but never the bearer token. If credentials enter the context, they can leak through model outputs, logging pipelines, or prompt injection. We made that structurally impossible by keeping credential resolution in a separate layer the model cannot reach.

Guardrails sit on top of this. Each action category carries a policy of allow, require_approval, or block. Policies are set at the org level and can be overridden per user, so an operator can approve destructive actions that stay blocked for everyone else. Every execution, approval, and block is written to an append-only audit log. A new skill with write operations defaults to require_approval until an admin widens the policy. The model cannot escalate its own permissions, because that path does not exist.

Three interfaces, one engine

We built the web interface first and assumed terminal and Slack would be afterthoughts. Then we watched how incidents actually play out. An alert fires in Slack. Someone opens a terminal. They check a dashboard. They go back to Slack to update the team. Four tools, no shared state. The multi-interface story is not a feature on top of Nova. It is the point of it.

User Interfaces · Same Engine

Nova Cloud

Skills + approvals

Plan Approval UI

Full Investigation View

Nova Terminal

Session continuity

Rich block rendering

Tab autocomplete

Slack

@nova mentions

Thread-based context

In-thread approvals

Nova Engine

Same AI · Same Skills · Same Authorization

All three interfaces hit the same backend, with the same skill access, authorization, and conversation state. Start an investigation in the browser, continue from the terminal with astroctl nova --continue, and get the results posted to Slack. The terminal is an operator tool, not a stripped-down web app: it renders investigation blocks natively, supports slash commands with tab completion, and handles structured paste for YAML manifests. Slack is not a lite version either. It is the full engine embedded in your team's communication flow, with approval buttons in-thread.

The hardest part was the rendering layer. Each surface has different constraints: the browser can render rich interactive components, the terminal needs ANSI block rendering, and notification or approval surfaces have their own layout rules. The engine outputs a single block protocol, and each client translates those blocks into its native format. Same data, different renderers.

What we got wrong

We over-engineered the routing layer. Early on we built a classifier to route query types (debugging, deployment, Q&A) to different execution paths with their own tool sets and prompts. It added a lot of complexity. Then we found that the model itself is better at deciding what to do next than our hand-tuned router. We removed the classifier and went with one unified flow where the model picks its own path. Simpler, and more reliable.

We underestimated context compaction. We thought a basic summarization pass would be enough. It was not close. A single Kubernetes investigation can generate more than 30k tokens of tool output in five turns, and naive summarization drops the pod names, error codes, and timestamps the model needs to reason correctly. We have rebuilt it three times and it is still not where we want it.

We assumed external APIs would be reliable. They are not. Datadog rate-limits aggressively during incidents, which is exactly when you need it. AWS APIs return partial results that look complete. GitHub occasionally changes a response schema without versioning the endpoint. Early on, one flaky call could derail an entire investigation because we had no circuit breakers or fallbacks. Now we treat every external call as something that can fail and plan for it.

We spent too long on prompt engineering instead of systems engineering. Our first instinct for every problem was to fix the prompt. Bad tool selection? Better system prompt. Verbose responses? Add "be concise." That works in demos and does not scale. The problems that actually matter, tool scaling, context management, authorization, and evaluation, are systems problems. The prompt matters, but it is maybe 20 percent of what makes a production agent work.

Where we are now

Nova now runs across browser and terminal, with Slack-connected approval and notification flows around the same operational model. It draws on what it knows about common failure modes, such as OOM kills, cert expiry, DNS, deployment regressions, and queue backlogs, as priors that seed the investigation, and reasons from first principles for the rest. The plan critic catches many over-engineered proposals, trajectory eval improves quality over time, and the authorization gateway is built so that dangerous actions require explicit approval.

The problems that remain are the hard ones. Context compaction is still lossy on long sessions. Novel failure modes fall back to general reasoning, which is slower. Multi-step plans still over-commit now and then. We are working on all of these. Some are engineering problems we can solve directly, and some are research problems where we expect the models themselves to keep improving. The architecture is in place, so these are improvements we can ship without rethinking the foundation.

If you are building production AI agents and running into the same problems, we would like to compare notes. The patterns that worked for us, hybrid retrieval for tool selection, explicit failure over silent loss, deterministic authorization middleware, and multi-stage evaluation, are not novel on their own. Making them work together, reliably, at production scale is where the real engineering is.

Try Nova. The deeper story on how Nova works across interfaces is in One AI, Every Interface. If these problems sound interesting, we are hiring.

Why we built this​

The hard problems​

Tool selection doesn't scale the way you think​

Context engineering is the real bottleneck​

Multi-provider abstraction is a minefield​

The LLM/tool boundary is an untrusted API​

Evaluation is how you build trust​

Skills, not integrations​

Three interfaces, one engine​

What we got wrong​

Where we are now​