Case study
Minecraft Monitoring
A PaperMC game server treated like a production service: JMX metrics, RCON control from a 10-command Discord bot, and automated lag alerts — 99.8% uptime over a 30-day window.
- PaperMC 26.1.2
- Java 25
- JMX Exporter
- RCON
- Discord.py
- Prometheus + Grafana
Context
Minecraft server administration usually means SSH and a screen session: type list into a console, eyeball raw TPS, restart by hand when it crashes. No metrics history, no alerting, and no way for the Discord community to check server status themselves. The goal was to treat a game server like any other production service — observable, alertable, and controllable from the place the community already lives.
Architecture
The server runs as a StatefulSet on k3s with three integration layers: a JMX exporter that exposes TPS, heap usage, GC pauses, and player count to Prometheus; the RCON protocol as a TCP control channel; and a Discord.py bot with 10 slash commands. The bot runs as a sidecar in the same pod, so RCON calls hit localhost:25575 with zero network hops, while Prometheus queries travel the Istio mesh with mTLS enforced.
Prometheus alert rules fire on two thresholds — TPS below 18 (lag) and heap above 400 MB (memory pressure). Either one posts a Discord embed with the last 5 minutes of metrics, deduplicated within a 5-minute window so a sustained degradation cannot storm the channel.
Key decisions & tradeoffs
- Split reads from writes. Read commands (
/status,/players,/tps) query Prometheus and render Discord embeds with zero server load; write commands (/restart,/whitelist,/ban) open an RCON session, execute, and close within 200 ms. - Role checks before every write. No RCON socket opens until the Discord role check passes.
- Lock RCON down with the mesh. RCON has no real auth beyond a shared password, so an Istio authorization policy allows only the bot's service account to reach port 25575.
- Slower scrape for game metrics. The JMX exporter adds ~30 MB of heap — noticeable on a 2 GB node — so game-server metrics scrape at 30s instead of 15s.
- ZGC on Java 25. Kept the last GC pause at 45 ms, comfortably under the ~100 ms mark where players notice rubberbanding; pause percentiles in Grafana let me correlate player complaints with GC events.
- Plan around Discord's command cache. Slash-command metadata sits behind a 1-hour global cache, so bot deployments are scheduled around that window.
Measured outcomes
- Uptime
- 99.8%
- 30-day window
- Command latency
- ~180 ms
- incl. Discord round-trip
- Last GC pause
- 45 ms
- ZGC, Java 25
- Alert storms
- 0
- 5-min dedup window