The Vanilla Production Monitor

What We Built

The Vanilla Production Monitor uses a locally-running instance of the upstream agent platform to watch our production deployment. The vanilla instance SSHes into the production droplet as a read-only observer, uses a monitoring API key to hit admin telemetry endpoints, and sends Slack alerts when something looks wrong. It’s four modular skills (container health, API health, upstream drift, usage anomalies), each scheduled independently.

It works. The design is clean and the skills are composable. But it rests on some assumptions that might not hold.

What Makes It Fragile

The monitoring instance runs on a local machine. If that machine sleeps or loses network, monitoring stops silently. We’ve noted this before in the context of Amber Monitor (which has the same problem), but it’s worth naming specifically here: a monitor that goes offline is often indistinguishable from a production environment that’s healthy.

The approach also creates a version coupling we haven’t fully thought through. The vanilla instance reads DEVIATIONS.md to understand what it’s looking at. If our fork diverges significantly from upstream, the monitor’s understanding of what’s normal may drift too.

The Question

Is “use an AI agent to watch production” the right model for a platform that serves multiple users? It’s a great pattern for bootstrapping and for a small team with high technical context. But a purpose-built monitoring service (Grafana, Better Uptime, something in between) would give us dashboards, alerting histories, and on-call routing that an AI agent fundamentally can’t replicate today.

The vanilla monitor is a thoughtful stopgap. The question is whether we have a plan for what comes after it, and what the trigger should be for moving on.