Operations & Observability
Operational excellence hinges on reliable deployment workflows, comprehensive telemetry, and repeatable recovery procedures. This chapter documents the tooling, metrics, logging, and runbooks that support Cirrus CDN in production.
Deployment Workflows
Local Development
just up– Builds and launches the Docker Compose stack.just down/just down-no-volumes– Tears down containers with or without volume removal.just pytest– Executes backend tests (uv run pytest -q).just quicktest– Runs a filtered suite excluding long-running ACME/DNS tests.just fmt– Formats Python codebase (autoflake, isort, black).
Production Deployment
just deploy– Invokes Ansible playbooks (ansible/), usingINVENTORYandPLAYBOOKenvironment variables loaded viadotenv.- Playbooks should orchestrate infrastructure provisioning, secret injection, and configuration templating consistent with this white paper.
Runtime Supervision
- Process Supervision – Use an init system (systemd, Kubernetes) to supervise API, worker, beat, OpenResty, and NSD processes. Compose is suited for local dev only.
- Health Checks – Monitor:
- API
/healthz(ensures Redis availability). - OpenResty
http://<node>:9145/healthz(used by Celery health checks). - Celery worker liveness (e.g., ping tasks).
- API
Metrics & Dashboards
- Prometheus (
prometheus/prometheus.dev.yml) scrapes OpenResty metrics every 5 seconds at127.0.0.1:9145. - Key Metrics:
nginx_http_requests_total{host,status}– Request rates by host/status.nginx_http_request_duration_seconds– Latency histogram.nginx_cache_status_total{status}– Cache behavior (HIT/MISS/STALE/BYPASS).nginx_upstream_errors_total,nginx_upstream_timeouts_total– Backend health.nginx_upstream_response_seconds– Upstream RTT.nginx_ssl_handshake_errors_total{phase}– TLS handshake issues.
- Grafana – Ship Grafana dashboards (provisioning under
grafana/) to visualize cache efficiency, origin health, and request trends. Enable auth per organizational policy. - Celery Metrics – Currently absent; recommended enhancements include emitting task durations via StatsD/Prometheus exporters (see Chapter 11).
Logging
- API & Workers – Log to stdout/stderr. Capture
acme_*events, errors from Redis, and zone rebuilds. - OpenResty Access Logs – Stored at
/data/access-logs/access.log; rotated bylogrotatecontainer. Ensure retention policies align with privacy regulations. - OpenResty Error Logs – Default location
/usr/local/openresty/nginx/logs/error.log; monitor for Redis connectivity, backend failures, and SSL errors. - Celery Logs –
cirrus-workerandcirrus-beatcontainers log to stdout; integrate with log aggregation (ELK, Loki) in production.
Alerting Recommendations
| Signal | Alert Condition | Response |
|---|---|---|
| Cache hit ratio drop | Sudden drop below threshold (e.g., less than 40%) | Inspect upstream availability, caching rules. |
| Upstream errors | Spike in nginx_upstream_errors_total | Check origin health, network latency. |
| ACME failure | acme_fail log or cdn:acme:{domain} status failed | Investigate DNS alignment, acme-dns reachability. |
| Node deactivation | Health task reports down state | Validate node; optionally disable or remove via /api/v1/nodes. |
| Prometheus scrape failure | Missing metrics from a node | Confirm OpenResty health endpoint availability. |
Backup & Recovery
- Redis Persistence – AOF (
--appendonly yes) ensures data durability. Implement backups (RDB snapshots, managed service backups) and test restore procedures. - Certificates – Since certificates reside in Redis, backups capture them automatically. Plan for secure storage and rotation.
- DNS Zone State – Recomputed from Redis; no additional backup required if Redis is intact.
- Configuration – Infrastructure-as-code (Ansible, Dockerfiles) should be version-controlled; document manual adjustments.
Troubleshooting Playbooks
- Domain Returns 404 – Check
cdn:dom:{domain}exists; inspect OpenResty logs forrouter: no conf for host. If config exists, ensureoriginsarray is populated. - TLS Handshake Failure – Ensure
cdn:cert:{domain}contains valid PEM entries; reviewssl_loaderlogs forparse/setfailures. - ACME Issuance Stuck – Inspect
cdn:acme:lock:{domain}andcdn:acme:task:{domain}; if lingering beyond TTL, clear keys and requeue; verify_acme-challengeCNAME. - Node Missing from DNS – Confirm
activeflag is"1"incdn:node:{id}; review health check logs for failures. - Metrics Missing – Validate
NGX_METRICS_ALLOWincludes Prometheus source; ensure port 9145 reachable; inspect OpenResty error log for Lua errors.
Testing Strategy
- Automated tests under
tests/cover API behavior, ACME flows, and DNS health scenarios. Runjust fresh-testbefore major releases to ensure a clean environment. - Integration tests rely on the full Docker stack (
just up+pytest), exercising acme-dns and OpenResty interactions.
Robust observability and operational hygiene keep Cirrus CDN resilient. Chapter 11 concludes the white paper with reference appendices for quick lookup.