How We Designed Nova's Investigation Engine — Lessons from SRE at Scale
When pods crash at odd hours, you need an AI that investigates like an SRE, not a chatbot. You need something that checks the right things, in the right order, tells you what it found, and waits for your call before touching anything. Current tools either dump a wall of logs on you and say "good luck," or run opaque automations you can't see, can't trust, and can't explain in a postmortem.
We spent the last several months designing and building Nova's investigation engine. This post is about the approach we took, the mental models that shaped it, and the trade-offs we made along the way.