Skip to main content

The AstroPulse Journey

· 4 min read
Rajesh RC
Founder

When I started AstroPulse, the problem was easy to name and hard to live with: teams moving to the cloud were drowning in tools. Every provider had its own consoles, its own primitives, its own way to deploy an app and stand up a cluster. The work that mattered, shipping software, kept getting buried under the work of operating it.

I wanted one place to deploy an application and run a cluster, on any cloud, without learning five different platforms first. That was the beginning. Everything since has been built on top of that one idea, one layer at a time.

The AI SRE Race Is Running the Wrong Way

· 7 min read
Rajesh RC
Founder

AI diagnosis flowing through a governed approval gate into production infrastructure

The thesis

Diagnosis is a commodity. Trust is the product.

The AI SRE race will not be won by the agent that diagnoses fastest. It will be won by the system that operators trust enough to grant write access.

A personal note on where AI for operations is actually heading.

Over the past year a new category filled up fast. Depending on how you count, there are now more than a dozen credible tools that call themselves AI SREs. I have watched the space closely, partly because we are building in it, and partly because the speed of convergence is genuinely interesting.

Here is what nearly all of them do. They connect to your telemetry, your code, and your incident tooling. They correlate logs, metrics, and traces. When an alert fires, they form hypotheses, test them against the evidence, and post a likely root cause into Slack, often in under a minute. This is real progress. A few years ago none of it worked. Today most of it does.

Phase one is real

Diagnosis is real progress. But it is just phase one.

How We Designed Nova's Investigation Engine: Lessons from SRE at Scale

· 8 min read
Rajesh RC
Founder

When something breaks in production at an odd hour, the person on call has to do three things at once: understand what is happening, decide what to do about it, and be able to explain all of it the next day. Most AI incident tools help with at most one of these. They either give you more data to read, or they take action you cannot see and cannot account for afterward.

We spent the last several months building Nova's investigation engine around that gap. This post is about how we designed it, the models we borrowed from, and the trade-offs we made along the way.

Bring Your Own Kubernetes Cluster

· 5 min read
Rajesh RC
Founder

Six months in as platform lead and you have a spreadsheet you haven't shown your manager. Eleven Kubernetes clusters. EKS for production. GKE for the ML team. Two on-prem clusters behind the firewall that predate your tenure. A handful of Kind clusters developers spun up locally. Each one has its own deployment pipeline, its own credentials rotation process, its own way of answering "is this service healthy?"

Your team isn't building features anymore. You're maintaining eleven slightly different versions of the same tooling.

astroctl infra k8s register --name my-cluster

The Hardest Problems in Building Production AI Agents

· 25 min read
Rajesh RC
Founder

Every AI agent demo looks the same. The model calls a tool, gets a result, and responds. Then you run it against real infrastructure, and the demo falls apart in ways the tutorials never mention.

We have spent over a year building Nova, an AI agent that operates real infrastructure for real teams. It is not a chatbot that wraps API calls. It investigates incidents, executes remediations, and composes across dozens of integrations. This post is about what we learned: the problems that made us rebuild entire subsystems, and the patterns that survived.

One AI, Every Interface

· 6 min read
Rajesh RC
Founder

Platform teams carry operational knowledge that does not transfer easily. The debugging instincts, the service interdependencies, the deployment quirks: they accumulate over years and live in a few people's heads. When those people are unavailable, the gap shows.

We built Nova to put that knowledge into a system you can query. This post covers what the architecture looks like and what we learned building AI that actually operates infrastructure.

Building AI Infrastructure: The Case for Specialized Models and AI Agents

· 52 min read
Rajesh RC
Founder
TL;DR - What You'll Learn

Building Enterprise AI Infrastructure: The Six Pillars, Specialized Models, and Emerging AI Agents

This deep-dive explores:

  • The six pillars required (Data Infrastructure, GPU Infrastructure, Training Pipeline, Model Serving, Supporting Services, Security & Governance)
  • Why specialized small models outperform foundation models for enterprises (85% better on domain tasks, 13-33x cheaper, data sovereignty)
  • How emerging AI agents are changing economics (5-10 person platform teams → 1-2 engineers + AI agents)
  • The open-source stack (KServe, vLLM, SGLang, TensorRT-LLM, MLflow, Kubeflow, DeepSpeed, Temporal)
  • Why current tools are fragmented and operationally complex
  • The vision: Self-hosted infrastructure with managed-platform simplicity—powered by specialized models for business logic + AI agents for operations

Introduction

Enterprises are discovering they can run powerful AI models on their own infrastructure—but building production AI infrastructure is significantly harder than application deployment.

This post breaks down the six interconnected systems required, why specialized small models outperform foundation models for enterprise use cases, how emerging AI agents are changing the economics, and the engineering trade-offs at every layer.

The Vision: Making AI infrastructure as simple as git push
📖 About This Guide

This is a comprehensive technical deep-dive. We explore the complete AI infrastructure landscape—from why enterprises build their own platforms to the six pillars required and the open-source technologies available.

  • 🎯 Looking for specific topics? Use the navigation guide below to jump to what you need
  • 📚 Want to understand the full picture? Read through—it's structured as a comprehensive exploration of AI infrastructure challenges and solutions

From Git Push to Production: Your Own Self-Hosted Platform

· 57 min read
Rajesh RC
Founder
TL;DR: What You'll Build

In this guide, you'll build your own Vercel-like platform on Kubernetes in ~30 minutes:

  • You'll deploy an EKS cluster with kpack (auto-builds), cert-manager (TLS), external-dns (DNS), and nginx-ingress
  • You'll configure automatic Git push → build → live HTTPS deployment (just like Vercel)
  • You'll run any workload: web apps, APIs, databases, microservices, background jobs—any language
  • You'll add security scanning, compliance controls, and observability for production
  • You'll use Nova to debug, troubleshoot, and operate your platform

Perfect for building internal developer platforms, launching SaaS products, or meeting enterprise compliance requirements.

Introduction

You want the simplicity of "push code, get a live URL"—the developer experience Vercel pioneered—but with full control over your deployment, infrastructure, and compliance. This guide shows you how to build that experience on your own AWS infrastructure using AstroPulse and open-source tools: kpack, cert-manager, external-dns, and nginx-ingress.

AstroPulse PaaS Flow Architecture

You'll build a production-grade platform that delivers Git-push deployments with automatic TLS certificates, preview URLs, and complete observability—all running on infrastructure you own and control. Unlike hosted PaaS platforms, you'll be building on Kubernetes with full deployment control. That means you can run any workload: microservices (with or without public endpoints), stateful databases, WebSockets, long-running background jobs, AI/ML model training and serving, or traditional web applications in any language. You get the simple developer experience with complete architectural control.

How operations work: The infrastructure industry is moving toward an agentic era—AI agents autonomously handling complex workflows (MCP, A2A, multi-agent orchestration). We're heading toward infrastructure that self-configures, self-heals, and self-optimizes. We're not there yet, but Nova brings you AI-assisted operations today with human-in-the-loop. Day 1 (this guide): You build the platform. Day 2 (ongoing): Nova analyzes issues, diagnoses problems, recommends fixes—you approve. As AI matures, more becomes autonomous.

📖 About This Guide

This is a comprehensive, production-ready blueprint. We cover everything from architecture to production deployment with complete working examples, security, compliance, and troubleshooting.

  • Want the fast track? Jump to our automated setup script (platform deploys in 30 minutes)
  • 🎯 Looking for specific topics? Use the navigation guide below to jump to what you need
  • 📚 Want to understand every detail? Read through—it's structured as a comprehensive step-by-step walkthrough