Skip to main content

How We Designed Nova's Investigation Engine — Lessons from SRE at Scale

· 9 min read
Rajesh RC
Founder

When pods crash at odd hours, you need an AI that investigates like an SRE, not a chatbot. You need something that checks the right things, in the right order, tells you what it found, and waits for your call before touching anything. Current tools either dump a wall of logs on you and say "good luck," or run opaque automations you can't see, can't trust, and can't explain in a postmortem.

We spent the last several months designing and building Nova's investigation engine. This post is about the approach we took, the mental models that shaped it, and the trade-offs we made along the way.

Bring Your Own Kubernetes Cluster

· 5 min read
Rajesh RC
Founder

Six months in as platform lead and you have a spreadsheet you haven't shown your manager. Eleven Kubernetes clusters. EKS for production. GKE for the ML team. Two on-prem clusters behind the firewall that predate your tenure. A handful of Kind clusters developers spun up locally. Each one has its own deployment pipeline, its own credentials rotation process, its own way of answering "is this service healthy?"

Your team isn't building features anymore. You're maintaining eleven slightly different versions of the same tooling.

astroctl infra k8s register --name my-cluster

The Hardest Problems in Building Production AI Agents

· 25 min read
Rajesh RC
Founder

Every AI agent demo looks the same. The model calls a tool, gets a result, responds. Ship it. Then you try to run it against real infrastructure — and the demo falls apart in ways nobody warned you about.

We've spent over a year building Nova, an AI agent that operates real infrastructure for real teams. Not a chatbot that wraps API calls, but a system that investigates incidents, executes remediations, and composes across dozens of integrations. This post is about what we learned — the problems that made us rebuild entire subsystems, and the patterns that survived.

One AI, Every Interface

· 6 min read
Rajesh RC
Founder

Platform teams carry operational knowledge that doesn't transfer easily. The debugging instincts, the service interdependencies, the deployment quirks — they accumulate over years and live in a small number of people's heads. When those people are unavailable, the gap shows.

We built Nova to encode that operational knowledge into a queryable system. This post covers what the architecture looks like and what we learned building AI that actually operates infrastructure.

Building AI Infrastructure: The Case for Specialized Models and AI Agents

· 52 min read
Rajesh RC
Founder
TL;DR - What You'll Learn

Building Enterprise AI Infrastructure: The Six Pillars, Specialized Models, and Emerging AI Agents

This deep-dive explores:

  • The six pillars required (Data Infrastructure, GPU Infrastructure, Training Pipeline, Model Serving, Supporting Services, Security & Governance)
  • Why specialized small models outperform foundation models for enterprises (85% better on domain tasks, 13-33x cheaper, data sovereignty)
  • How emerging AI agents are changing economics (5-10 person platform teams → 1-2 engineers + AI agents)
  • The open-source stack (KServe, vLLM, SGLang, TensorRT-LLM, MLflow, Kubeflow, DeepSpeed, Temporal)
  • Why current tools are fragmented and operationally complex
  • The vision: Self-hosted infrastructure with managed-platform simplicity—powered by specialized models for business logic + AI agents for operations

Introduction

Enterprises are discovering they can run powerful AI models on their own infrastructure—but building production AI infrastructure is significantly harder than application deployment.

This post breaks down the six interconnected systems required, why specialized small models outperform foundation models for enterprise use cases, how emerging AI agents are changing the economics, and the engineering trade-offs at every layer.

The Vision: Making AI infrastructure as simple as git push
📖 About This Guide

This is a comprehensive technical deep-dive. We explore the complete AI infrastructure landscape—from why enterprises build their own platforms to the six pillars required and the open-source technologies available.

  • 🎯 Looking for specific topics? Use the navigation guide below to jump to what you need
  • 📚 Want to understand the full picture? Read through—it's structured as a comprehensive exploration of AI infrastructure challenges and solutions

From Git Push to Production: Your Own Self-Hosted Platform

· 57 min read
Rajesh RC
Founder
TL;DR: What You'll Build

In this guide, you'll build your own Vercel-like platform on Kubernetes in ~30 minutes:

  • You'll deploy an EKS cluster with kpack (auto-builds), cert-manager (TLS), external-dns (DNS), and nginx-ingress
  • You'll configure automatic Git push → build → live HTTPS deployment (just like Vercel)
  • You'll run any workload: web apps, APIs, databases, microservices, background jobs—any language
  • You'll add security scanning, compliance controls, and observability for production
  • You'll use Nova to debug, troubleshoot, and operate your platform

Perfect for building internal developer platforms, launching SaaS products, or meeting enterprise compliance requirements.

Introduction

You want the simplicity of "push code, get a live URL"—the developer experience Vercel pioneered—but with full control over your deployment, infrastructure, and compliance. This guide shows you how to build that experience on your own AWS infrastructure using AstroPulse and open-source tools: kpack, cert-manager, external-dns, and nginx-ingress.

AstroPulse PaaS Flow Architecture

You'll build a production-grade platform that delivers Git-push deployments with automatic TLS certificates, preview URLs, and complete observability—all running on infrastructure you own and control. Unlike hosted PaaS platforms, you'll be building on Kubernetes with full deployment control. That means you can run any workload: microservices (with or without public endpoints), stateful databases, WebSockets, long-running background jobs, AI/ML model training and serving, or traditional web applications in any language. You get the simple developer experience with complete architectural control.

How operations work: The infrastructure industry is moving toward an agentic era—AI agents autonomously handling complex workflows (MCP, A2A, multi-agent orchestration). We're heading toward infrastructure that self-configures, self-heals, and self-optimizes. We're not there yet, but Nova brings you AI-assisted operations today with human-in-the-loop. Day 1 (this guide): You build the platform. Day 2 (ongoing): Nova analyzes issues, diagnoses problems, recommends fixes—you approve. As AI matures, more becomes autonomous.

📖 About This Guide

This is a comprehensive, production-ready blueprint. We cover everything from architecture to production deployment with complete working examples, security, compliance, and troubleshooting.

  • Want the fast track? Jump to our automated setup script (platform deploys in 30 minutes)
  • 🎯 Looking for specific topics? Use the navigation guide below to jump to what you need
  • 📚 Want to understand every detail? Read through—it's structured as a comprehensive step-by-step walkthrough

Meet Nova - Your AI Platform Engineer

· 7 min read
Rajesh RC
Founder

Platform engineering represents the natural evolution of DevOps and SRE principles, but it faces a fundamental challenge: how do you scale platform expertise across an entire organization without requiring every developer to become a cloud expert?

This is where Nova comes in — your AI platform engineer that makes infrastructure accessible to everyone through natural conversation.

The Platform Engineering Evolution

DevOps broke down silos → SRE brought engineering rigor to operations → Platform Engineering created self-service infrastructure → Nova makes platform engineering conversational and accessible to everyone.

The Platform Engineering Challenge: Scale vs. Expertise

Platform engineering promised to solve the "you build it, you run it" scaling problem by creating Internal Developer Platforms (IDPs). But even the best platforms face fundamental limitations:

👥

Expert Bottlenecks

Platform teams become the new constraint—everyone depends on their expertise

📚

Documentation Decay

Complex systems require constant documentation that quickly becomes outdated

🧠

Context Loss

Critical operational knowledge lives in tribal knowledge, not systems

⚙️

Cognitive Load

Developers still need to understand infrastructure concepts to use platforms effectively

The Core Issue

We've built self-service platforms, but we haven't solved the underlying problem of democratizing platform engineering expertise.

Enter Nova: Your AI Platform Engineer

Nova is an AI platform engineer that helps you manage infrastructure through natural conversation. Ask questions, get answers, generate configurations, and troubleshoot issues — all through simple chat.

What makes Nova different:

  • Works with your existing tools (Slack, GitHub, AWS, Terraform, Kubernetes, and more)
  • Available however you want to work — browser, self-hosted, or in your editor
  • Extensible through Skills and MCP integrations

What Can Nova Do?

💰

Cloud Cost Estimation

Calculate costs across AWS, Google Cloud, and Azure before you deploy

⚙️

Infrastructure Code Generation

Generate Terraform, Helm charts, and Kubernetes manifests through conversation

🔍

Troubleshooting

Debug infrastructure issues with AI-powered analysis and recommendations

🔗

Tool Integration

Connect Slack, GitHub, and your existing tools for seamless workflows

Example conversations:

  • "How much would 10 t3.medium instances cost per month in us-west-2?"
  • "Generate Terraform for an EKS cluster with autoscaling"
  • "Why is my pod crashing in the production namespace?"
  • "Create a Helm chart for a Node.js app with Redis"

Three Ways to Use Nova

Nova is available however you prefer to work:

Nova Cloud

Zero setup — just start chatting.

Go to Open Nova → and start asking questions. We handle everything.

  • Full access to all Skills (Slack, GitHub, AWS, Terraform, and more)
  • Conversation history and cross-device sync
  • No installation required

Nova Direct

Self-hosted with complete control.

Run Nova on your infrastructure via Docker. Your data and runtime stay in your environment.

  • Browser-based local interface
  • MCP Marketplace to add integrations
  • Add any MCP servers you need
  • Air-gapped environment support

Nova Connect

Nova in your favorite editor.

Use Nova through the hosted remote MCP endpoint in tools like Claude Code, Cursor, VS Code, Claude Desktop, and other OAuth-capable MCP clients.

  • Hosted remote MCP for editor and CLI clients
  • Managed sign-in through the hosted OAuth flow
  • Same Nova capabilities in your development workflow

Extensible by Design

Nova's power comes from its extensibility. Connect the tools you already use:

Available Skills:

  • Cloud Providers — AWS, Google Cloud, Azure cost calculations and resource management
  • Communication — Slack integration for team collaboration
  • Development — GitHub for code search, issues, and PRs
  • Infrastructure — Terraform and Helm configuration generation
  • Kubernetes — Cluster management and troubleshooting
  • And more — Add any MCP server for custom integrations
Bring Your Own Tools

Nova Direct includes the built-in MCP Marketplace for custom integrations. Nova Connect works through standard MCP-compatible clients, so teams can bring Nova into existing editor and CLI workflows without running the server locally.

The Vision: Platform Engineering for Everyone

The evolution from DevOps → SRE → Platform Engineering → AI-Assisted Platform Engineering represents more than technological progress — it's about democratizing expertise that has historically been scarce and expensive.

Traditional Model

  • Small teams of platform experts serve entire organizations
  • Knowledge bottlenecks create deployment delays
  • Scaling requires hiring more specialists
  • Critical knowledge lives in people, not systems

Nova Model

  • Platform expertise available to every team member
  • Knowledge scales instantly without hiring
  • Best practices consistently applied
  • Operational knowledge captured and improved

Get Started

Podcast: Deep Dive into Cloud Infrastructure Management

· 2 min read
Rajesh RC
Founder

Welcome to our technical deep-dive podcast where we explore how Astro Platform is transforming cloud infrastructure management. Join us as we discuss the challenges, solutions, and future of cloud computing with industry experts.

Watch the video version:

Episode Highlights

The Cloud Evolution

According to Gartner's prediction highlighted in our discussion, by 2025, over 95% of new digital workloads will be deployed on cloud-native platforms, up from 30% in 2021. This dramatic shift underscores the urgency for efficient cloud management solutions.

Key Challenges Addressed

  • Multiple tool management across different cloud providers
  • Complex Kubernetes orchestration
  • Resource optimization and cost management
  • Security and compliance maintenance

Astro Platform Solutions

  1. Unified Management Interface

    • Single dashboard for all cloud providers
    • Simplified Kubernetes cluster management
    • Integrated monitoring and logging
  2. Cost Optimization with AI

    • Resource usage analysis
    • Automated scaling recommendations
    • Potential savings of up to $320,000 annually for mid-sized enterprises
  3. Security and Future-Proofing

    • Proactive security measures
    • Adaptable architecture for emerging technologies
    • Support for serverless computing integration
Get Started

Ready to simplify your cloud infrastructure? Check out our platform documentation to begin your journey.

Additional Resources

For those interested in exploring topics discussed in this episode:

Contact Us

For more information or to speak directly with our team, please contact us.