Minecraft Monitoring — game-server observability with Discord control

Context

Minecraft server administration usually means SSH and a screen session: type list into a console, eyeball raw TPS, restart by hand when it crashes. No metrics history, no alerting, and no way for the Discord community to check server status themselves. The goal was to treat a game server like any other production service — observable, alertable, and controllable from the place the community already lives.

Architecture

The server runs as a StatefulSet on k3s with three integration layers: a JMX exporter that exposes TPS, heap usage, GC pauses, and player count to Prometheus; the RCON protocol as a TCP control channel; and a Discord.py bot with 10 slash commands. The bot runs as a sidecar in the same pod, so RCON calls hit localhost:25575 with zero network hops, while Prometheus queries travel the Istio mesh with mTLS enforced.

Reads go through Prometheus (no server load); writes go through RCON in the same pod.

Prometheus alert rules fire on two thresholds — TPS below 18 (lag) and heap above 400 MB (memory pressure). Either one posts a Discord embed with the last 5 minutes of metrics, deduplicated within a 5-minute window so a sustained degradation cannot storm the channel.

Key decisions & tradeoffs

Split reads from writes. Read commands (/status, /players, /tps) query Prometheus and render Discord embeds with zero server load; write commands (/restart, /whitelist, /ban) open an RCON session, execute, and close within 200 ms.
Role checks before every write. No RCON socket opens until the Discord role check passes.
Lock RCON down with the mesh. RCON has no real auth beyond a shared password, so an Istio authorization policy allows only the bot's service account to reach port 25575.
Slower scrape for game metrics. The JMX exporter adds ~30 MB of heap — noticeable on a 2 GB node — so game-server metrics scrape at 30s instead of 15s.
ZGC on Java 25. Kept the last GC pause at 45 ms, comfortably under the ~100 ms mark where players notice rubberbanding; pause percentiles in Grafana let me correlate player complaints with GC events.
Plan around Discord's command cache. Slash-command metadata sits behind a 1-hour global cache, so bot deployments are scheduled around that window.

Measured outcomes

Uptime: 99.8%; 30-day window
Command latency: ~180 ms; incl. Discord round-trip
Last GC pause: 45 ms; ZGC, Java 25
Alert storms: 0; 5-min dedup window

Context

Architecture

Key decisions & tradeoffs

Measured outcomes

Links