DNS & Traffic Engineering
Cirrus CDN integrates an internal DNS authority that dynamically reflects control plane state. This chapter details how the CNAME service assembles zone data, monitors node health, and communicates with secondary name servers to steer clients toward healthy cache nodes.
CNAME Service Overview
src/cirrus/cname/service.py defines CnameService, which orchestrates:
- Loading the base configuration via
get_cname_settings()(cname/settings.py). - Spawning a hidden master DNS server (
HiddenMasterServerincname/dns_server.py). - Subscribing to Redis pub/sub channel
cdn:cname:dirty. - Rebuilding the DNS zone when configuration changes or health events occur.
- Publishing NOTIFY messages to downstream secondaries (e.g., NSD).
The service starts during FastAPI application startup and maintains an internal background task to react to dirty events.
Configuration & Settings
cname/settings.py reads environment variables to populate CnameSettings:
CNAME_BASE_DOMAIN: Apex domain (cdn.local.testindocker-compose.yml).CNAME_REPLICAS_PER_SITE: Number of nodes assigned per site (default 2).CNAME_DEFAULT_TTL: TTL for generated records (default 60 seconds).DNS_SOA_*variables: SOA timing parameters used for negative caching and refresh.DNS_NS1_A,DNS_NS1_AAAA: Glue records for the authoritative nameserver.DNS_MASTER_BIND_ADDR,DNS_MASTER_PORT: Hidden master listening address/port (default0.0.0.0:10053).CNAME_DNS_SLAVES: Comma-separated list of slave endpoints (IPv4/IPv6).
Helper NodeHealthSettings controls health check behavior, mapping environment variables such as NODE_HEALTH_PORT, NODE_HEALTH_INTERVAL_SECS, NODE_HEALTH_FAILS_TO_DOWN, and NODE_HEALTH_SUCCS_TO_UP.
Zone Construction
Rendezvous Hashing
cname/hashing.py implements rendezvous hashing (rendezvous_topk) to produce stable node assignments per domain. This algorithm:
- Seeds the hash with the access FQDN.
- Scores each node ID and selects the top
replicas_per_site. - Minimizes shuffling when nodes are added or removed.
ZoneBuilder
cname/zone.py contains ZoneBuilder:
- Filters nodes to those that are active (
node.can_serve()). - Generates
A/AAAArecords pointing to node IPs for each domain (converted to an IDNA-safe access FQDN). - Adds SOA and NS records for the base zone.
- Computes a deterministic signature of record content; increments serials only when the signature changes to stay RFC-compliant.
- Produces a
ZoneSnapshotwithrecords,serial, andgenerated_attimestamp.
Access FQDNs
compute_access_fqdn(domain, settings) converts user domains into service endpoints:
<site-domain>.<base_fqdn>
For example, www.example.com becomes www.example.com.cdn.local.test. Validation ensures each label is under 63 characters and the FQDN is within 255 characters.
Hidden Master Server
cname/dns_server.py implements a lightweight authoritative server with both UDP and TCP listeners:
- Security – Only allows queries/AXFR from configured slave IPs (normalized using
ipaddress). Unauthorized clients receiveREFUSED. - UDP Handling – Responds to standard queries, supports NXDOMAIN vs NOERROR semantics, attaches SOA for negative answers.
- TCP AXFR – Streams full zone transfers by replaying records sequentially, ensuring compatibility with NSD.
- NOTIFY – For each zone rebuild, the service sends DNS NOTIFY messages to every slave so they can initiate an AXFR. Retries are performed with exponential backoff inside a thread executor to avoid blocking the event loop.
A ngx-style LRU cache (for TLS) is not used here; DNS relies on in-memory snapshots and locking to guarantee consistency across rebuilds.
Health-Driven Routing
cname/health.py runs via Celery (see Chapter 4):
- Polls
cdn:nodesfor registered nodes. - Performs HTTP GETs against each node's
/healthzendpoint on the configured port (default 9145, which maps to OpenResty's metrics server). - Adjusts
health_fails/health_succscounters, toggling theactiveflag when thresholds are crossed. - Publishes
cdn:cname:dirtywhen active status changes to force DNS updates. - Returns structured results for logging or debugging.
This health loop decouples control-plane operations from data-plane stability and ensures DNS only advertises healthy endpoints.
Integration with External DNS
docker-compose.yml provisions an nsd container to act as a publicly reachable slave:
- Receives NOTIFY messages on
10054. - Serves the zone to clients (or other recursive resolvers).
- Cirrus can scale to additional secondaries by extending
CNAME_DNS_SLAVES.
In production, NSD containers would typically run in different regions and anycast IPs to reduce latency.
Failure Modes & Recovery
- Redis Outages – Without configuration, the DNS service fails to start; the API logs
cname_settings_invalid. During runtime, if the service cannot rebuild zones (e.g., due to data fetch errors), it logs and continues listening for subsequent events. - Node Flaps – Health checks rely on consecutive failure/success thresholds to avoid thrash. Rendezvous hashing limits record churn.
- Slave Disconnects – NOTIFY failures are logged per slave; operators must ensure network reachability.
- Configuration Mistakes – Invalid base domains or mis-specified slave entries produce startup errors and prevent the API from running, surfacing misconfiguration early.
Operational Considerations
- Ensure
CNAME_BASE_DOMAINis delegated to NSD instances (or equivalent) in production. - Monitor
cname_hidden_master_listeninglogs for bind addresses andcname_notify_failedfor slave communication issues. - Scrutinize Redis keyspace for stray entries; stale domains will be removed automatically when deleted via the API, but manual Redis inspection is straightforward thanks to descriptive key names.
DNS is the lynchpin that links control plane declarations to end-user experience. The next chapter examines how OpenResty interprets the same domain and node state during request handling.