Skip to content

How Ployz models a cluster

A Ployz cluster is a flat mesh of peer machines. There is no master node. No machine holds state that others lack, and no machine’s removal breaks coordination. You can remove any node — including the one you are currently connected to — without a quorum ceremony or a controller migration. This peer model is what makes machine remove safe by construction.

When a machine joins the cluster it receives:

  • A WireGuard identity (public key and overlay IPv6 address).
  • A NATS leaf node connection to the cluster’s control-plane store.
  • A subnet for workload container networking.
  • A machine ID, region, and optional availability zone.

From that point forward, the machine is a full participant. Coordination, locking, and state visibility work through NATS on a peer-oriented model. No machine is elected leader for the cluster as a whole; authority is scoped to deploy namespaces, not to the cluster globally.

NATS is not a message bus bolted on for convenience. It is the native substrate for everything the control plane needs to do:

  • Durable facts. Deploy commits, machine membership records, routing events, and instance status are stored in NATS JetStream streams and KV buckets. These survive daemon restarts.
  • Coordination. Deploy leases, participant locks, and quorum decisions happen through NATS. A command that cannot acquire a lock fails loudly; it does not queue or retry silently.
  • Request/reply commands. Small participant actions — start a container, probe readiness, confirm a volume transfer — use NATS request/reply on per-machine subjects. No responder or timeout is an immediate foreground failure.
  • Ordered routing events. The gateway and DNS service consume a NATS stream of routing events and rebuild their view from it. If freshness becomes uncertain, they reload rather than serving stale projections.

Ployz separates state into three categories that are never mixed:

KindWhat it representsExamples
IntentWhat an operator explicitly asked the cluster to doDeploy commits, machine membership records, instance status, routing events
StatusDurable lifecycle facts emitted by operationsDeploy phase records, volume movement evidence, branch lineage
ObservationLive reachability, health, and capacity checked at decision timePlacement probes, WireGuard handshake state, participant readiness

Intent and Status live in NATS JetStream — they are durable and survive restarts. Observation is always checked live, at the moment a decision needs to be made. The cluster does not rewrite Intent from stale Observations.

Workloads are grouped into namespaces. A namespace is the unit of deploy authority: one owning authority accepts durable deploy writes for a namespace, and routing events belong to that authority.

Machines are members of the cluster itself, not of any particular namespace. A single machine can run workloads from multiple namespaces. Placement decisions — which machines receive which workloads — happen at deploy time, based on live machine capacity and region role.

Every machine has a region and an optional availability zone. These are operator-assigned topology labels used to guide placement decisions. Regions have one of four roles:

RolePlacement behavior
home_dataReceives new placements; preferred for stateful workloads
computeReceives new placements; preferred for stateless workloads
drainingNo new placements; existing workloads drain off
disabledNo new placements; excluded from all placement decisions

Region roles affect where workloads land. They do not create separate write authorities. Deploy commits, instance records, and routing events always belong to the namespace’s owning authority regardless of which region a machine is in.

Ployz targets clusters in the 1–200 node range. This is not an arbitrary limit — it is the range in which an operator can understand the whole system, explain every workload’s placement, and reason about a migration or branch operation end-to-end.

ployzctl dev runs the full cluster model locally. All primitives — branch, migrate, rollback — work identically to a multi-node cluster.

Up to 200 nodes joined into one WireGuard mesh. One model, one set of primitives, no operational bifurcation between “dev” and “production”.

Kubernetes is the right tool for 10,000-node fleets. Ployz is the right tool when you want an operator — human or agent — who can hold the entire cluster model in working memory and make decisions with complete information.