Case study

MeshWatch

Production-grade observability — metrics, logs, traces, and AI-assisted incident analysis — on a single k3s node for $5.12/month: 60% less than a comparable serverless setup, at 99.8% uptime.

kubernetes · service mesh · June 2026

  • Kubernetes (k3s)
  • Istio
  • Prometheus + Grafana
  • Loki
  • Tempo
  • Ollama Phi-3
  • Flagger

Context

Managed observability platforms price by ingest volume and host count. For a homelab running five microservices with Prometheus scraping every 15 seconds, a managed bill crosses $12.80/month before the first alert even fires. The question behind MeshWatch: can one small k3s node deliver all three pillars of observability — metrics, logs, traces — plus an AI incident responder, without giving up reliability?

Architecture

Everything runs on a single k3s node (1 CPU, 2 GB RAM). Istio injects an Envoy sidecar into every service pod and enforces STRICT mutual TLS — zero plaintext traffic between any two pods, with certificates rotated automatically every 30 days. Telemetry flows through three parallel pipelines into Grafana, and Prometheus alerts feed a local LLM for root-cause analysis.

MeshWatch architecture: pods to telemetry pipelines to Grafana, with alerts routed to a local LLM 5 service pods Envoy sidecars · Istio STRICT mTLS Prometheus metrics · 15s Loki Promtail logs Tempo traces · OTel Grafana — 20-panel dashboards alerts Ollama Phi-3 (local LLM) root-cause analysis · via Tailscale
Three telemetry pipelines on one node; alerts get an AI root-cause pass locally.

When an alert fires, the incident pipeline captures the last 5 minutes of affected metrics and ships structured JSON to Ollama Phi-3 over an outbound-only Tailscale tunnel. The model returns a root-cause write-up with remediation steps in about 2 seconds — and no metrics data ever leaves the mesh. This caught an Istio sidecar memory leak that raw Prometheus thresholds missed.

Key decisions & tradeoffs

  • k3s over full Kubernetes. The stripped control plane runs in ~512 MB, leaving 1.5 GB for the entire observability stack on a 2 GB node.
  • Loki over Elasticsearch. Chunk-compressed log storage costs about a tenth as much for the same ingest rate.
  • Local Phi-3 instead of an LLM API. At 50 alerts a day that is $0 vs roughly $30/month on API pricing. The tradeoff: a small quantized model occasionally hallucinates remediation steps, so every AI suggestion gets human sign-off before action.
  • Canary releases with Flagger. Traffic shifts 10% → 25% → 50% → 100% with Prometheus SLO checks at each stage; if the error rate passes 0.5%, traffic reverts to stable within 30 seconds — no human in the loop.
  • One node, eyes open. No control-plane HA means a node reboot causes a ~90-second observability blind spot — acceptable for a homelab; production would get a second node. Each Envoy sidecar also costs ~50 MB, which caps the mesh at 15 pods before memory pressure.

Measured outcomes

Monthly cost
$5.12
60% below serverless parity
Uptime
99.8%
15 pods, 5 services
Avg response time
45 ms
0.2% error rate
AI root cause
~2 s
per incident, on-node
Annual savings
$92.16
vs serverless baseline