Skip to main content

DNS & Traffic Engineering

Cirrus CDN integrates an internal DNS authority that dynamically reflects control plane state. This chapter details how the CNAME service assembles zone data, monitors node health, and communicates with secondary name servers to steer clients toward healthy cache nodes.

CNAME Service Overview

src/cirrus/cname/service.py defines CnameService, which orchestrates:

  • Loading the base configuration via get_cname_settings() (cname/settings.py).
  • Spawning a hidden master DNS server (HiddenMasterServer in cname/dns_server.py).
  • Subscribing to Redis pub/sub channel cdn:cname:dirty.
  • Rebuilding the DNS zone when configuration changes or health events occur.
  • Publishing NOTIFY messages to downstream secondaries (e.g., NSD).

The service starts during FastAPI application startup and maintains an internal background task to react to dirty events.

Configuration & Settings

cname/settings.py reads environment variables to populate CnameSettings:

  • CNAME_BASE_DOMAIN: Apex domain (cdn.local.test in docker-compose.yml).
  • CNAME_REPLICAS_PER_SITE: Number of nodes assigned per site (default 2).
  • CNAME_DEFAULT_TTL: TTL for generated records (default 60 seconds).
  • DNS_SOA_* variables: SOA timing parameters used for negative caching and refresh.
  • DNS_NS1_A, DNS_NS1_AAAA: Glue records for the authoritative nameserver.
  • DNS_MASTER_BIND_ADDR, DNS_MASTER_PORT: Hidden master listening address/port (default 0.0.0.0:10053).
  • CNAME_DNS_SLAVES: Comma-separated list of slave endpoints (IPv4/IPv6).

Helper NodeHealthSettings controls health check behavior, mapping environment variables such as NODE_HEALTH_PORT, NODE_HEALTH_INTERVAL_SECS, NODE_HEALTH_FAILS_TO_DOWN, and NODE_HEALTH_SUCCS_TO_UP.

Zone Construction

Rendezvous Hashing

cname/hashing.py implements rendezvous hashing (rendezvous_topk) to produce stable node assignments per domain. This algorithm:

  • Seeds the hash with the access FQDN.
  • Scores each node ID and selects the top replicas_per_site.
  • Minimizes shuffling when nodes are added or removed.

ZoneBuilder

cname/zone.py contains ZoneBuilder:

  • Filters nodes to those that are active (node.can_serve()).
  • Generates A/AAAA records pointing to node IPs for each domain (converted to an IDNA-safe access FQDN).
  • Adds SOA and NS records for the base zone.
  • Computes a deterministic signature of record content; increments serials only when the signature changes to stay RFC-compliant.
  • Produces a ZoneSnapshot with records, serial, and generated_at timestamp.

Access FQDNs

compute_access_fqdn(domain, settings) converts user domains into service endpoints:

<site-domain>.<base_fqdn>

For example, www.example.com becomes www.example.com.cdn.local.test. Validation ensures each label is under 63 characters and the FQDN is within 255 characters.

Hidden Master Server

cname/dns_server.py implements a lightweight authoritative server with both UDP and TCP listeners:

  • Security – Only allows queries/AXFR from configured slave IPs (normalized using ipaddress). Unauthorized clients receive REFUSED.
  • UDP Handling – Responds to standard queries, supports NXDOMAIN vs NOERROR semantics, attaches SOA for negative answers.
  • TCP AXFR – Streams full zone transfers by replaying records sequentially, ensuring compatibility with NSD.
  • NOTIFY – For each zone rebuild, the service sends DNS NOTIFY messages to every slave so they can initiate an AXFR. Retries are performed with exponential backoff inside a thread executor to avoid blocking the event loop.

A ngx-style LRU cache (for TLS) is not used here; DNS relies on in-memory snapshots and locking to guarantee consistency across rebuilds.

Health-Driven Routing

cname/health.py runs via Celery (see Chapter 4):

  • Polls cdn:nodes for registered nodes.
  • Performs HTTP GETs against each node's /healthz endpoint on the configured port (default 9145, which maps to OpenResty's metrics server).
  • Adjusts health_fails / health_succs counters, toggling the active flag when thresholds are crossed.
  • Publishes cdn:cname:dirty when active status changes to force DNS updates.
  • Returns structured results for logging or debugging.

This health loop decouples control-plane operations from data-plane stability and ensures DNS only advertises healthy endpoints.

Integration with External DNS

docker-compose.yml provisions an nsd container to act as a publicly reachable slave:

  • Receives NOTIFY messages on 10054.
  • Serves the zone to clients (or other recursive resolvers).
  • Cirrus can scale to additional secondaries by extending CNAME_DNS_SLAVES.

In production, NSD containers would typically run in different regions and anycast IPs to reduce latency.

Failure Modes & Recovery

  • Redis Outages – Without configuration, the DNS service fails to start; the API logs cname_settings_invalid. During runtime, if the service cannot rebuild zones (e.g., due to data fetch errors), it logs and continues listening for subsequent events.
  • Node Flaps – Health checks rely on consecutive failure/success thresholds to avoid thrash. Rendezvous hashing limits record churn.
  • Slave Disconnects – NOTIFY failures are logged per slave; operators must ensure network reachability.
  • Configuration Mistakes – Invalid base domains or mis-specified slave entries produce startup errors and prevent the API from running, surfacing misconfiguration early.

Operational Considerations

  • Ensure CNAME_BASE_DOMAIN is delegated to NSD instances (or equivalent) in production.
  • Monitor cname_hidden_master_listening logs for bind addresses and cname_notify_failed for slave communication issues.
  • Scrutinize Redis keyspace for stray entries; stale domains will be removed automatically when deleted via the API, but manual Redis inspection is straightforward thanks to descriptive key names.

DNS is the lynchpin that links control plane declarations to end-user experience. The next chapter examines how OpenResty interprets the same domain and node state during request handling.