Case study
MeshWatch
Production-grade observability — metrics, logs, traces, and AI-assisted incident analysis — on a single k3s node for $5.12/month: 60% less than a comparable serverless setup, at 99.8% uptime.
- Kubernetes (k3s)
- Istio
- Prometheus + Grafana
- Loki
- Tempo
- Ollama Phi-3
- Flagger
Context
Managed observability platforms price by ingest volume and host count. For a homelab running five microservices with Prometheus scraping every 15 seconds, a managed bill crosses $12.80/month before the first alert even fires. The question behind MeshWatch: can one small k3s node deliver all three pillars of observability — metrics, logs, traces — plus an AI incident responder, without giving up reliability?
Architecture
Everything runs on a single k3s node (1 CPU, 2 GB RAM). Istio injects an Envoy sidecar into every service pod and enforces STRICT mutual TLS — zero plaintext traffic between any two pods, with certificates rotated automatically every 30 days. Telemetry flows through three parallel pipelines into Grafana, and Prometheus alerts feed a local LLM for root-cause analysis.
When an alert fires, the incident pipeline captures the last 5 minutes of affected metrics and ships structured JSON to Ollama Phi-3 over an outbound-only Tailscale tunnel. The model returns a root-cause write-up with remediation steps in about 2 seconds — and no metrics data ever leaves the mesh. This caught an Istio sidecar memory leak that raw Prometheus thresholds missed.
Key decisions & tradeoffs
- k3s over full Kubernetes. The stripped control plane runs in ~512 MB, leaving 1.5 GB for the entire observability stack on a 2 GB node.
- Loki over Elasticsearch. Chunk-compressed log storage costs about a tenth as much for the same ingest rate.
- Local Phi-3 instead of an LLM API. At 50 alerts a day that is $0 vs roughly $30/month on API pricing. The tradeoff: a small quantized model occasionally hallucinates remediation steps, so every AI suggestion gets human sign-off before action.
- Canary releases with Flagger. Traffic shifts 10% → 25% → 50% → 100% with Prometheus SLO checks at each stage; if the error rate passes 0.5%, traffic reverts to stable within 30 seconds — no human in the loop.
- One node, eyes open. No control-plane HA means a node reboot causes a ~90-second observability blind spot — acceptable for a homelab; production would get a second node. Each Envoy sidecar also costs ~50 MB, which caps the mesh at 15 pods before memory pressure.
Measured outcomes
- Monthly cost
- $5.12
- 60% below serverless parity
- Uptime
- 99.8%
- 15 pods, 5 services
- Avg response time
- 45 ms
- 0.2% error rate
- AI root cause
- ~2 s
- per incident, on-node
- Annual savings
- $92.16
- vs serverless baseline