Case study
Monitoring Stack
A monitoring platform where Git is the single source of truth: ArgoCD reconciles Prometheus, Loki, TLS certificates, and secrets — with zero manual kubectl apply in production.
- ArgoCD
- kube-prometheus-stack
- Loki + Promtail
- External Secrets Operator
- cert-manager
- Azure Key Vault
Context
Manual Kubernetes deployments drift. A kubectl apply here, a helm upgrade there, and within a week nobody is sure what is actually running. The goal: a monitoring platform where every change is a reviewed Git commit and the cluster self-heals to the declared state.
Architecture
ArgoCD sits at the top, watching a repo structured with the App-of-Apps pattern: a root Application points at a directory of child Applications, each deploying one component into its own namespace — kube-prometheus-stack (v85.2.1 Helm chart) for Prometheus, Alertmanager, and Grafana; Loki with Promtail for logs; External Secrets Operator; and cert-manager. Namespace isolation means a misconfigured Loki deploy cannot block a Prometheus rolling update.
Key decisions & tradeoffs
- App of Apps, with a known weak point. Per-namespace child apps isolate blast radius, but the root Application is a single point of failure — if ArgoCD cannot reach the repo, nothing reconciles. Mitigation: a local ArgoCD backup and a documented manual recovery path.
- Secrets never touch Git. An
ExternalSecretCRD references Azure Key Vault, and the operator syncs it into a native Kubernetes Secret on a 1-hour refresh — no app code changes. Tradeoff: a rotation can be stale for up to 60 minutes, so critical rotations get a manual refresh. - TLS as a controller, not a chore. cert-manager provisions Let's Encrypt certificates via HTTP-01 and renews 30 days before expiry: zero manual cert operations across 5 ingress routes, and expired certs — the top cause of cascade failures in my earlier homelab — became structurally impossible. Lesson learned: Let's Encrypt rate limits (50 certs per domain per week) bite during testing, so iteration happens on the staging issuer.
- Dashboards as code. The 20-panel Grafana dashboard lives as JSON in a ConfigMap that ArgoCD reconciles — an edit is a Git commit, not a click in the Grafana UI.
Measured outcomes
- Services monitored
- 5
- across 4 namespaces
- Alert rules
- 7
- TPS to node pressure
- Grafana panels
- 20
- primary dashboard
- Log retention
- 7 days
- ~1.2 GB indexed
- Manual applies
- 0
- everything is GitOps
Links
- Live homelab dashboard — served by this stack
- MeshWatch case study — the observability layer this platform deploys
This platform runs on the private homelab, so there is no public repo — ask me for a walkthrough.