Case study

Minecraft Monitoring

A PaperMC game server treated like a production service: JMX metrics, RCON control from a 10-command Discord bot, and automated lag alerts — 99.8% uptime over a 30-day window.

devops · discord · May 2026

  • PaperMC 26.1.2
  • Java 25
  • JMX Exporter
  • RCON
  • Discord.py
  • Prometheus + Grafana

Context

Minecraft server administration usually means SSH and a screen session: type list into a console, eyeball raw TPS, restart by hand when it crashes. No metrics history, no alerting, and no way for the Discord community to check server status themselves. The goal was to treat a game server like any other production service — observable, alertable, and controllable from the place the community already lives.

Architecture

The server runs as a StatefulSet on k3s with three integration layers: a JMX exporter that exposes TPS, heap usage, GC pauses, and player count to Prometheus; the RCON protocol as a TCP control channel; and a Discord.py bot with 10 slash commands. The bot runs as a sidecar in the same pod, so RCON calls hit localhost:25575 with zero network hops, while Prometheus queries travel the Istio mesh with mTLS enforced.

Minecraft monitoring architecture: Discord bot reads via Prometheus and writes via RCON to the PaperMC pod Discord.py bot 10 slash commands · role checks reads writes Prometheus TPS · heap · players RCON localhost:25575 JMX · 30s commands PaperMC 26.1.2 · Java 25 k3s StatefulSet · Istio mTLS · ZGC
Reads go through Prometheus (no server load); writes go through RCON in the same pod.

Prometheus alert rules fire on two thresholds — TPS below 18 (lag) and heap above 400 MB (memory pressure). Either one posts a Discord embed with the last 5 minutes of metrics, deduplicated within a 5-minute window so a sustained degradation cannot storm the channel.

Key decisions & tradeoffs

  • Split reads from writes. Read commands (/status, /players, /tps) query Prometheus and render Discord embeds with zero server load; write commands (/restart, /whitelist, /ban) open an RCON session, execute, and close within 200 ms.
  • Role checks before every write. No RCON socket opens until the Discord role check passes.
  • Lock RCON down with the mesh. RCON has no real auth beyond a shared password, so an Istio authorization policy allows only the bot's service account to reach port 25575.
  • Slower scrape for game metrics. The JMX exporter adds ~30 MB of heap — noticeable on a 2 GB node — so game-server metrics scrape at 30s instead of 15s.
  • ZGC on Java 25. Kept the last GC pause at 45 ms, comfortably under the ~100 ms mark where players notice rubberbanding; pause percentiles in Grafana let me correlate player complaints with GC events.
  • Plan around Discord's command cache. Slash-command metadata sits behind a 1-hour global cache, so bot deployments are scheduled around that window.

Measured outcomes

Uptime
99.8%
30-day window
Command latency
~180 ms
incl. Discord round-trip
Last GC pause
45 ms
ZGC, Java 25
Alert storms
0
5-min dedup window