Monitoring Stack — GitOps monitoring with ArgoCD

Context

Manual Kubernetes deployments drift. A kubectl apply here, a helm upgrade there, and within a week nobody is sure what is actually running. The goal: a monitoring platform where every change is a reviewed Git commit and the cluster self-heals to the declared state.

Architecture

ArgoCD sits at the top, watching a repo structured with the App-of-Apps pattern: a root Application points at a directory of child Applications, each deploying one component into its own namespace — kube-prometheus-stack (v85.2.1 Helm chart) for Prometheus, Alertmanager, and Grafana; Loki with Promtail for logs; External Secrets Operator; and cert-manager. Namespace isolation means a misconfigured Loki deploy cannot block a Prometheus rolling update.

One root Application, four children — each in its own namespace, all reconciled from Git.

Key decisions & tradeoffs

App of Apps, with a known weak point. Per-namespace child apps isolate blast radius, but the root Application is a single point of failure — if ArgoCD cannot reach the repo, nothing reconciles. Mitigation: a local ArgoCD backup and a documented manual recovery path.
Secrets never touch Git. An ExternalSecret CRD references Azure Key Vault, and the operator syncs it into a native Kubernetes Secret on a 1-hour refresh — no app code changes. Tradeoff: a rotation can be stale for up to 60 minutes, so critical rotations get a manual refresh.
TLS as a controller, not a chore. cert-manager provisions Let's Encrypt certificates via HTTP-01 and renews 30 days before expiry: zero manual cert operations across 5 ingress routes, and expired certs — the top cause of cascade failures in my earlier homelab — became structurally impossible. Lesson learned: Let's Encrypt rate limits (50 certs per domain per week) bite during testing, so iteration happens on the staging issuer.
Dashboards as code. The 20-panel Grafana dashboard lives as JSON in a ConfigMap that ArgoCD reconciles — an edit is a Git commit, not a click in the Grafana UI.

Measured outcomes

Services monitored: 5; across 4 namespaces
Alert rules: 7; TPS to node pressure
Grafana panels: 20; primary dashboard
Log retention: 7 days; ~1.2 GB indexed
Manual applies: 0; everything is GitOps

Links

This platform runs on the private homelab, so there is no public repo — ask me for a walkthrough.