Case study
Azure Functions
Two Python functions decouple alert generation from Discord delivery through Azure Service Bus — Pydantic-validated events and 5-minute dedup windows cut alert noise by ~80% during incidents.
- Azure Functions v2.0
- Python
- Service Bus
- Timer Trigger
- Pydantic
- aiohttp
Context
When Prometheus fires an alert, something has to turn that event into a message a human will actually read. Baking notification logic into Alertmanager templates means every notification change is a configmap edit and an ArgoCD sync. I wanted a decoupled serverless layer that consumes incident events from a queue, validates them, and routes them to Discord — deployable and scalable on its own.
Architecture
Two functions on the Azure Functions v2.0 Python runtime, two triggers. A health checker runs on a 15-minute timer, polls each service's health endpoint, and emits an incident event to Service Bus on failure. An incident processor subscribes to the incident-events topic, validates, deduplicates, and posts a Discord embed. Service Bus is the decoupling boundary: Alertmanager and the health checker produce, the processor is the sole consumer — adding Slack or PagerDuty later means writing a new consumer, not touching the producers.
Key decisions & tradeoffs
- Validate at the boundary. Every message passes a Pydantic model that enforces required fields (service, severity, timestamp, metric value) and dead-letters malformed payloads instead of dropping them silently. This caught a producer bug sending severity as
"WARNING"instead of"warning". - Deduplicate on (service, alert-rule). A sustained degradation can fire the same alert every 30 seconds; the 5-minute window acknowledges duplicates without forwarding them. During a 2-hour Istio sidecar memory leak that meant ~80% less noise — and, since Service Bus bills per operation, a smaller bill too.
- Async webhooks. The Discord POST runs on
aiohttpso the function never blocks on the round-trip; failures retry three times with exponential backoff, then dead-letter for manual inspection. - Work with cold starts, not against them. Cold start adds 1–3 s after idle — fine for a 15-minute health check, painful on the alert path — so a heartbeat event every 5 minutes keeps the processor warm.
- Pin your models. Pydantic v2's migration broke field aliases that worked in v1; the version is pinned in
requirements.txt.
Measured outcomes
- Functions deployed
- 2
- timer + Service Bus trigger
- Health checks
- 96/day
- per service, 15-min interval
- Alert noise cut
- ~80%
- 5-min dedup, during incidents
- Webhook retries
- 3
- exponential backoff, then DLQ
Links
The function app runs in a private Azure subscription, so there is no public repo or endpoint — happy to walk through the code.