Routing, gateway, and DNS: how Ployz serves traffic
Routing and deploys in Ployz follow a single baseline rule: traffic only sees committed, routable facts. The gateway and DNS are projections rebuilt from durable state, not authoritative stores of their own. This page explains the deploy truth model, how an apply flows through to a final commit, and what “routable” means in practice.
Deploy truth model
Section titled “Deploy truth model”One owning authority accepts durable deploy writes. All deploy state belongs to that authority, even when workloads run across multiple regions or machines. Regions affect placement decisions — draining and disabled regions do not receive new placements — but they do not create additional write authorities.
The following table describes every kind of state Ployz tracks for deploys and routing, how it is stored, and what it is used for.
| Data | Kind | Notes |
|---|---|---|
| Deploy commits | Durable intent | Immutable events appended to cp_deploy_commits_<authority>. Each commit is a point of no return for the facts it contains. |
| Deploy status | Durable status | Mutable lifecycle record in cp_deploy_status_<authority>. Transitions through applying, committed, failed, and FailedAfterCheckpoint. |
| Deploy phase records | Durable status | Per-phase execution state, work, policies, and commit linkage in cp_deploy_phases_<authority>. |
| Branch lineage | Durable intent | Committed service source lineage folded from deploy commits. Explains which source revision a target service came from. |
| Volume movement evidence | Durable intent | Committed volume source/target and verified transfer proof folded from deploy commits. Explains which deploy and phase moved a volume, which machines were involved, and which verified transfer snapshot made the ownership change safe. |
| Instance records | Durable status | Runtime lifecycle in cp_instances_<authority>. |
| Routing events | Projection | Ordered facts in routing_events_<authority>. Rebuildable from stored intent. The gateway and DNS consume these to build their projections. |
| Placement probes | Live facts | NATS request/reply. No responder means the machine is unavailable right now. Not stored. |
| ZFS transfer progress | Live facts | Foreground operation evidence while a transfer is running. Only verified success folded into a deploy commit becomes durable movement evidence. |
| Deploy lock | Live coordination | Lease in cp_locks_<authority>. Prevents concurrent deploys to the same namespace. |
The apply flow
Section titled “The apply flow”A deploy apply is foreground work with nine explicit steps. Each step has a defined scope and a defined failure behavior.
Preview manifest against current stored intent
Section titled “Preview manifest against current stored intent”The orchestrator reads current deploy commits, status, and instance records to understand what is already running. This produces a plan: which instances need to start, which need to stop, which volumes need to move.
Acquire the namespace deploy lease
Section titled “Acquire the namespace deploy lease”The orchestrator acquires a deploy lock in cp_locks_<authority> for the target namespace. This prevents a second concurrent apply from starting. If the lock is already held, the apply fails immediately with a structured error.
Probe eligible machines for live capacity
Section titled “Probe eligible machines for live capacity”For each candidate machine, the orchestrator sends a NATS request/reply to the machine’s command subject. The response includes current capacity. No responder or timeout marks the machine as unavailable for this deploy.
Write applying deploy status and pending phase records
Section titled “Write applying deploy status and pending phase records”The deploy status is written as applying. Per-phase records are written with pending state. These are the first durable writes of the apply.
Execute phase-owned work
Section titled “Execute phase-owned work”For each phase: stop moved-volume writers on the source machine, perform any blocking ZFS moves, then start candidate containers on target machines and wait for readiness.
Append checkpoint commits for intermediate phases
Section titled “Append checkpoint commits for intermediate phases”For checkpoint phases, append an immutable deploy commit containing the facts owned by that phase. Link the phase record to the commit ID. After a checkpoint commit, later failure is reported as FailedAfterCheckpoint — the checkpointed facts remain durable.
Append the final deploy commit
Section titled “Append the final deploy commit”Append the final immutable deploy commit for all remaining facts. Link end-of-deploy phase records to the final deploy ID. This is the point of no return. Before this commit, failure aborts with no lasting state change. After it, the new version is live.
Publish derived routing events
Section titled “Publish derived routing events”Publish ordered routing events to routing_events_<authority>. The gateway and DNS consume these events to update their projections.
Drain and remove old instances
Section titled “Drain and remove old instances”Old instances are drained and stopped. Cleanup failures are recorded as explicit recoverable status — they do not erase the fact that the new version is live and do not revert the committed deploy.
Routing projection
Section titled “Routing projection”The gateway and DNS are stateless projections. They do not hold authoritative routing state; they rebuild it from durable records.
On startup:
- Load stored routing state from the store (machines, revisions, releases, instances).
- Then consume ordered routing events from the stream, applying each event to update the in-memory projection.
- Begin serving.
While running, the gateway and DNS continue consuming routing events as they arrive. Each event is applied in order: upserts replace the existing record for a contract identity, removals drop the matching record.
If freshness becomes uncertain — for example, if the consumer falls behind or loses its position in the stream — the projection discards its local view entirely and rebuilds from stored state, then resumes consuming events. This is a deliberate safety property: serving a stale projection silently is worse than the brief interruption of a rebuild.
What “routable” means
Section titled “What “routable” means”An instance is routable when all of the following conditions hold:
- Ready — the instance has completed startup and passed its readiness check
- Not draining — the instance is not in the process of being stopped for a migration or deploy
- No errors — no unrecovered error state recorded for the instance
- Has overlay IP — the instance has a reachable overlay network address
- Matches current slot, machine, and revision — the instance belongs to the current committed release, not a superseded one
An instance that fails any condition is excluded from the gateway’s upstream pool and from DNS records. There is no grace period or optimistic inclusion. The commit that made a new revision live also makes old instances eligible for removal, but those old instances remain in the projection until they are explicitly drained and their routing events are published.
Remote commands via NATS request/reply
Section titled “Remote commands via NATS request/reply”Participant actions during an apply — starting a candidate container, stopping a writer before a volume move, checking capacity — use NATS request/reply on per-machine subjects. Each command targets an explicit machine. There are no broadcast commands and no session state.
ployz.v1.<installation>.<authority>.rpc.node.<machine_id>.<command>No responder and timeout fail the foreground operation. The orchestrator does not retry silently — it returns a structured RpcFailure to the caller. The caller or operator decides whether to retry the whole operation.