When pods crash at odd hours, you need an AI that investigates like an SRE, not a chatbot. You need something that checks the right things, in the right order, tells you what it found, and waits for your call before touching anything. Current tools either dump a wall of logs on you and say "good luck," or run opaque automations you can't see, can't trust, and can't explain in a postmortem.
We spent the last several months designing and building Nova's investigation engine. This post is about the approach we took, the mental models that shaped it, and the trade-offs we made along the way.
Six months in as platform lead and you have a spreadsheet you haven't shown your manager. Eleven Kubernetes clusters. EKS for production. GKE for the ML team. Two on-prem clusters behind the firewall that predate your tenure. A handful of Kind clusters developers spun up locally. Each one has its own deployment pipeline, its own credentials rotation process, its own way of answering "is this service healthy?"
Your team isn't building features anymore. You're maintaining eleven slightly different versions of the same tooling.
Every AI agent demo looks the same. The model calls a tool, gets a result, responds. Ship it. Then you try to run it against real infrastructure — and the demo falls apart in ways nobody warned you about.
We've spent over a year building Nova, an AI agent that operates real infrastructure for real teams. Not a chatbot that wraps API calls, but a system that investigates incidents, executes remediations, and composes across dozens of integrations. This post is about what we learned — the problems that made us rebuild entire subsystems, and the patterns that survived.
Platform teams carry operational knowledge that doesn't transfer easily. The debugging instincts, the service interdependencies, the deployment quirks — they accumulate over years and live in a small number of people's heads. When those people are unavailable, the gap shows.
We built Nova to encode that operational knowledge into a queryable system. This post covers what the architecture looks like and what we learned building AI that actually operates infrastructure.
Enterprises are discovering they can run powerful AI models on their own infrastructure—but building production AI infrastructure is significantly harder than application deployment.
This post breaks down the six interconnected systems required, why specialized small models outperform foundation models for enterprise use cases, how emerging AI agents are changing the economics, and the engineering trade-offs at every layer.
📖 About This Guide
This is a comprehensive technical deep-dive. We explore the complete AI infrastructure landscape—from why enterprises build their own platforms to the six pillars required and the open-source technologies available.
You want the simplicity of "push code, get a live URL"—the developer experience Vercel pioneered—but with full control over your deployment, infrastructure, and compliance. This guide shows you how to build that experience on your own AWS infrastructure using AstroPulse and open-source tools: kpack, cert-manager, external-dns, and nginx-ingress.
You'll build a production-grade platform that delivers Git-push deployments with automatic TLS certificates, preview URLs, and complete observability—all running on infrastructure you own and control. Unlike hosted PaaS platforms, you'll be building on Kubernetes with full deployment control. That means you can run any workload: microservices (with or without public endpoints), stateful databases, WebSockets, long-running background jobs, AI/ML model training and serving, or traditional web applications in any language. You get the simple developer experience with complete architectural control.
How operations work: The infrastructure industry is moving toward an agentic era—AI agents autonomously handling complex workflows (MCP, A2A, multi-agent orchestration). We're heading toward infrastructure that self-configures, self-heals, and self-optimizes. We're not there yet, but Nova brings you AI-assisted operations today with human-in-the-loop. Day 1 (this guide): You build the platform. Day 2 (ongoing): Nova analyzes issues, diagnoses problems, recommends fixes—you approve. As AI matures, more becomes autonomous.
📖 About This Guide
This is a comprehensive, production-ready blueprint. We cover everything from architecture to production deployment with complete working examples, security, compliance, and troubleshooting.
⚡ Want the fast track? Jump to our automated setup script (platform deploys in 30 minutes)
Platform engineering represents the natural evolution of DevOps and SRE principles, but it faces a fundamental challenge: how do you scale platform expertise across an entire organization without requiring every developer to become a cloud expert?
This is where Nova comes in — your AI platform engineer that makes infrastructure accessible to everyone through natural conversation.
The Platform Engineering Evolution
DevOps broke down silos → SRE brought engineering rigor to operations → Platform Engineering created self-service infrastructure → Nova makes platform engineering conversational and accessible to everyone.
The Platform Engineering Challenge: Scale vs. Expertise
Platform engineering promised to solve the "you build it, you run it" scaling problem by creating Internal Developer Platforms (IDPs). But even the best platforms face fundamental limitations:
👥
Expert Bottlenecks
Platform teams become the new constraint—everyone depends on their expertise
📚
Documentation Decay
Complex systems require constant documentation that quickly becomes outdated
🧠
Context Loss
Critical operational knowledge lives in tribal knowledge, not systems
⚙️
Cognitive Load
Developers still need to understand infrastructure concepts to use platforms effectively
The Core Issue
We've built self-service platforms, but we haven't solved the underlying problem of democratizing platform engineering expertise.
Nova is an AI platform engineer that helps you manage infrastructure through natural conversation. Ask questions, get answers, generate configurations, and troubleshoot issues — all through simple chat.
What makes Nova different:
Works with your existing tools (Slack, GitHub, AWS, Terraform, Kubernetes, and more)
Available however you want to work — browser, self-hosted, or in your editor
Nova's power comes from its extensibility. Connect the tools you already use:
Available Skills:
Cloud Providers — AWS, Google Cloud, Azure cost calculations and resource management
Communication — Slack integration for team collaboration
Development — GitHub for code search, issues, and PRs
Infrastructure — Terraform and Helm configuration generation
Kubernetes — Cluster management and troubleshooting
And more — Add any MCP server for custom integrations
Bring Your Own Tools
Nova Direct includes the built-in MCP Marketplace for custom integrations. Nova Connect works through standard MCP-compatible clients, so teams can bring Nova into existing editor and CLI workflows without running the server locally.
The evolution from DevOps → SRE → Platform Engineering → AI-Assisted Platform Engineering represents more than technological progress — it's about democratizing expertise that has historically been scarce and expensive.
Traditional Model
Small teams of platform experts serve entire organizations
Welcome to our technical deep-dive podcast where we explore how Astro Platform is transforming cloud infrastructure management. Join us as we discuss the challenges, solutions, and future of cloud computing with industry experts.
According to Gartner's prediction highlighted in our discussion, by 2025, over 95% of new digital workloads will be deployed on cloud-native platforms, up from 30% in 2021. This dramatic shift underscores the urgency for efficient cloud management solutions.