Case study

Monitoring Stack

A monitoring platform where Git is the single source of truth: ArgoCD reconciles Prometheus, Loki, TLS certificates, and secrets — with zero manual kubectl apply in production.

devops · gitops · May 2026

  • ArgoCD
  • kube-prometheus-stack
  • Loki + Promtail
  • External Secrets Operator
  • cert-manager
  • Azure Key Vault

Context

Manual Kubernetes deployments drift. A kubectl apply here, a helm upgrade there, and within a week nobody is sure what is actually running. The goal: a monitoring platform where every change is a reviewed Git commit and the cluster self-heals to the declared state.

Architecture

ArgoCD sits at the top, watching a repo structured with the App-of-Apps pattern: a root Application points at a directory of child Applications, each deploying one component into its own namespace — kube-prometheus-stack (v85.2.1 Helm chart) for Prometheus, Alertmanager, and Grafana; Loki with Promtail for logs; External Secrets Operator; and cert-manager. Namespace isolation means a misconfigured Loki deploy cannot block a Prometheus rolling update.

Monitoring stack architecture: Git synced by ArgoCD, reconciling four child applications in separate namespaces Git repository App-of-Apps · single source of truth sync ArgoCD reconciles child Applications — one namespace each kube-prometheus Prometheus · Grafana Loki + Promtail logs · 7d retention External Secrets ← Azure Key Vault cert-manager ← Let's Encrypt
One root Application, four children — each in its own namespace, all reconciled from Git.

Key decisions & tradeoffs

  • App of Apps, with a known weak point. Per-namespace child apps isolate blast radius, but the root Application is a single point of failure — if ArgoCD cannot reach the repo, nothing reconciles. Mitigation: a local ArgoCD backup and a documented manual recovery path.
  • Secrets never touch Git. An ExternalSecret CRD references Azure Key Vault, and the operator syncs it into a native Kubernetes Secret on a 1-hour refresh — no app code changes. Tradeoff: a rotation can be stale for up to 60 minutes, so critical rotations get a manual refresh.
  • TLS as a controller, not a chore. cert-manager provisions Let's Encrypt certificates via HTTP-01 and renews 30 days before expiry: zero manual cert operations across 5 ingress routes, and expired certs — the top cause of cascade failures in my earlier homelab — became structurally impossible. Lesson learned: Let's Encrypt rate limits (50 certs per domain per week) bite during testing, so iteration happens on the staging issuer.
  • Dashboards as code. The 20-panel Grafana dashboard lives as JSON in a ConfigMap that ArgoCD reconciles — an edit is a Git commit, not a click in the Grafana UI.

Measured outcomes

Services monitored
5
across 4 namespaces
Alert rules
7
TPS to node pressure
Grafana panels
20
primary dashboard
Log retention
7 days
~1.2 GB indexed
Manual applies
0
everything is GitOps