MeshWatch — service mesh observability at $5/month

Context

Managed observability platforms price by ingest volume and host count. For a homelab running five microservices with Prometheus scraping every 15 seconds, a managed bill crosses $12.80/month before the first alert even fires. The question behind MeshWatch: can one small k3s node deliver all three pillars of observability — metrics, logs, traces — plus an AI incident responder, without giving up reliability?

Architecture

Everything runs on a single k3s node (1 CPU, 2 GB RAM). Istio injects an Envoy sidecar into every service pod and enforces STRICT mutual TLS — zero plaintext traffic between any two pods, with certificates rotated automatically every 30 days. Telemetry flows through three parallel pipelines into Grafana, and Prometheus alerts feed a local LLM for root-cause analysis.

Three telemetry pipelines on one node; alerts get an AI root-cause pass locally.

When an alert fires, the incident pipeline captures the last 5 minutes of affected metrics and ships structured JSON to Ollama Phi-3 over an outbound-only Tailscale tunnel. The model returns a root-cause write-up with remediation steps in about 2 seconds — and no metrics data ever leaves the mesh. This caught an Istio sidecar memory leak that raw Prometheus thresholds missed.

Key decisions & tradeoffs

k3s over full Kubernetes. The stripped control plane runs in ~512 MB, leaving 1.5 GB for the entire observability stack on a 2 GB node.
Loki over Elasticsearch. Chunk-compressed log storage costs about a tenth as much for the same ingest rate.
Local Phi-3 instead of an LLM API. At 50 alerts a day that is $0 vs roughly $30/month on API pricing. The tradeoff: a small quantized model occasionally hallucinates remediation steps, so every AI suggestion gets human sign-off before action.
Canary releases with Flagger. Traffic shifts 10% → 25% → 50% → 100% with Prometheus SLO checks at each stage; if the error rate passes 0.5%, traffic reverts to stable within 30 seconds — no human in the loop.
One node, eyes open. No control-plane HA means a node reboot causes a ~90-second observability blind spot — acceptable for a homelab; production would get a second node. Each Envoy sidecar also costs ~50 MB, which caps the mesh at 15 pods before memory pressure.

Measured outcomes

Monthly cost: $5.12; 60% below serverless parity
Uptime: 99.8%; 15 pods, 5 services
Avg response time: 45 ms; 0.2% error rate
AI root cause: ~2 s; per incident, on-node
Annual savings: $92.16; vs serverless baseline

Context

Architecture

Key decisions & tradeoffs

Measured outcomes

Links