How We Designed Nova's Investigation Engine: Lessons from SRE at Scale
When something breaks in production at an odd hour, the person on call has to do three things at once: understand what is happening, decide what to do about it, and be able to explain all of it the next day. Most AI incident tools help with at most one of these. They either give you more data to read, or they take action you cannot see and cannot account for afterward.
We spent the last several months building Nova's investigation engine around that gap. This post is about how we designed it, the models we borrowed from, and the trade-offs we made along the way.