Cluster Mode
BoilStream supports horizontal scaling through stable node identity, S3-backed cluster state, and Quack Multi-Raft for strongly consistent, cluster-wide control-plane writes (the cp-metadata catalog).
Production path
For multi-pod deployments, run cluster mode with the cp-metadata (Quack Multi-Raft) catalog enabled. S3 remains the discovery, backup, and disaster-recovery tier for bulk data; Quack Multi-Raft is the live consistency layer for control-plane metadata and any opt-in cluster-wide-write tables. See Multi-Raft for the consistency model and SQL interface.
Architecture
Cluster mode has two coordination layers:
| Layer | Purpose | Storage |
|---|---|---|
| S3 cluster state | Leader discovery, broker heartbeats, shared secrets, cold restore artifacts; bulk analytical data via DuckLake hot→cold tiering | cluster_state/ in the primary S3 backend |
Quack Multi-Raft (cp-metadata) | Strongly consistent catalog/user/topic/tenant metadata (and opt-in cluster-wide-write tables) across nodes | Per-node DuckDB catalog + NuRaft quorum over mTLS; consensus owned by quack-raft |
Bulk analytical data flows through the hot nodes to object storage (DuckLake hot/cold tiering), not through Raft. Quack Multi-Raft protects the control-plane state that must be coherent before nodes serve catalog and authorization decisions. See Multi-Raft.
Leadership And Consistency
Cluster mode has two leadership surfaces:
- S3-backed cluster leadership selects the server that owns cluster coordination and admin routing.
- The Quack
cp-metadataMulti-Raft group elects a Raft leader that orders catalog, topic, tenant, and grant mutations through quorum. Writes from any node are forwarded to the current leader by quack-raft.
In production multi-pod clusters, the cp-metadata Multi-Raft group is the source of truth for metadata convergence. S3 leadership alone is not the consistency contract for catalog visibility.
Configuration
cluster_mode:
enabled: true
node_id_path: "./node_id" # Persistent ID (over restarts)
# advertised_host = in-cluster DNS used for pod-to-pod gRPC on
# internal_api_port. In Kubernetes this is the headless-Service
# per-pod DNS (e.g. boilstream-0.boilstream-headless.<ns>.svc.cluster.local).
advertised_host: "node1.internal.example.com"
# public_host = externally-resolvable hostname. Goes into vended
# DuckLake credentials and admin redirects. Leave unset for
# standalone deployments — it falls back to advertised_host.
public_host: "boilstream-0.app.example.com"
# public_tcp_port = the per-pod gateway TCP listener that exposes
# PGWire without TLS-passthrough SNI gymnastics. The Helm chart
# sets this to 15432 + pod_index automatically. Standalone /
# bare-VM deployments leave it unset and clients use pgwire.port.
# public_tcp_port: 15432
internal_api_port: 8444 # Inter-node communication (PGWire SCRAM RPC, user-context RPC, bootstrap-token RPC)
# Strongly consistent, cluster-wide metadata via Quack Multi-Raft (cp-metadata).
# Replaces the removed metadata_raft_* keys. Consensus is owned by quack-raft.
cp_metadata_enabled: true
cp_metadata_port: 8445
# Voter peers in quack peers= form: node_id:role=quack:host:port
cp_metadata_peers:
- "n0:voter=quack:boilstream-0.boilstream-headless.default.svc.cluster.local:8445"
- "n1:voter=quack:boilstream-1.boilstream-headless.default.svc.cluster.local:8445"
- "n2:voter=quack:boilstream-2.boilstream-headless.default.svc.cluster.local:8445"
# Shared token for the cp-group quack endpoint (CREATE SECRET + serve token)
cp_metadata_token: "..."
leader_heartbeat_interval_secs: 30
leader_stale_threshold_secs: 120
broker_heartbeat_interval_secs: 30
broker_stale_threshold_secs: 120Node Roles
Leader Node
- Handles all catalog management operations (create/delete DuckLakes, user management)
- Coordinates cluster state
- Performs catalog backups to S3
- Proposes
cp-metadataMulti-Raft mutations for replicated metadata paths - Processes admin API requests
Broker Nodes
- Handle data ingestion (FlightRPC, Kafka, HTTP/2)
- Serve read queries via PostgreSQL interface
- Redirect admin requests to leader
- Apply committed
cp-metadataMulti-Raft entries locally and can serve linearizable metadata reads - Can be promoted to leader if current leader fails
Leader-RPC fan-out for stateful auth
Some auth paths can't be served from a worker's local DuckDB because the underlying state (vended credentials, in-flight bootstrap tokens) lives only on the leader. Workers fan out to the leader over the internal mTLS API on internal_api_port:
| Worker need | Leader RPC | What it returns |
|---|---|---|
| Validate a PGWire SCRAM handshake | GetScramHash(username) | The vended SCRAM-SHA-256 hash |
| Resolve a PGWire username → tenant context | GetPgwireUserContext(username) | (internal_user_id, user_id, tenant_id) |
| Issue a bootstrap token | GenerateBootstrapToken(...) | A signed token bound to the cluster secret |
This means any pod can terminate a PGWire / FlightSQL / admin connection without rejecting authenticated clients during a leader transition. Failover is transparent on existing TCP sessions.
Broker States
| State | Description |
|---|---|
active | Normal operation |
draining | Finishing work, rejecting new connections (for rolling updates) |
retiring | Preparing for shutdown |
shutdown | Node shutting down |
Operations
View Cluster Status
boilstream-admin cluster statusRolling Updates
# 1. Drain node
boilstream-admin cluster broker update node-123 --state draining
# 2. Wait for connections to finish, then stop/update/restart
# 3. Re-activate
boilstream-admin cluster broker update node-123 --state activeLeader Stepdown
# Graceful stepdown (automatic leader election)
boilstream-admin cluster leader stepdown
# Stepdown with preferred successor
boilstream-admin cluster leader stepdown --preferred node-456High Availability
- Automatic failover: If the cluster leader becomes stale, brokers compete for leadership; the
cp-metadataMulti-Raft group continues to preserve committed metadata through quorum - No single point of failure: Any node can become leader
- Consistent metadata: Quack Multi-Raft commits control-plane metadata through quorum and applies it on every healthy node
- Quorum availability: A 3-voter group remains responsive with any 2 healthy voters
- Durable bulk data: S3 (DuckLake) holds analytical data via hot→cold tiering, independent of Raft
- Graceful degradation: Ingestion continues even during leader transitions
Cluster Identity
Nodes belong to the same cluster when they share the same primary storage backend configuration. The cluster state is stored in S3 at:
s3://{storage.bucket}/{storage.prefix}cluster_state/Configure in config.yaml:
storage:
primary_backend:
type: s3
bucket: "my-cluster-bucket"
prefix: "prod/" # Cluster state at: prod/cluster_state/
region: "us-east-1"
access_key: "..."
secret_key: "..."All nodes must use identical storage backend settings to join the same cluster.
S3 State Structure
s3://{bucket}/{prefix}cluster_state/
├── leader.json # Current leader info
├── cluster_secret.json # Shared auth secret (auto-generated)
├── brokers/
│ ├── node-abc.json # Broker heartbeats
│ ├── node-def.json
│ └── node-ghi.json
└── catalogs/
└── backups/ # DuckLake catalog backupsThe cp-metadata Multi-Raft group keeps its own NuRaft log and snapshots locally per node (peer-to-peer reseed via quack-raft); it does not use an S3 snapshot pointer. The former openraft metadata_raft/ S3 layout no longer exists.
Circuit Breaker
Admin operations are protected by a circuit breaker that blocks changes if:
- Backup system is unhealthy
- S3 is unreachable
- Leader election is in progress
This prevents data loss during infrastructure issues.
Kubernetes
For multi-pod deployments on Kubernetes, use the official Helm chart — it wires up StatefulSet, per-pod Services, Envoy Gateway SNI routing, cert-manager TLS, and cluster-mode mTLS. See Kubernetes Deployment.
Next Steps
- Kubernetes Deployment - Helm chart for multi-pod clusters
- Multi-Raft - Cluster-wide quorum writes, SQL interface, cp-metadata
- Reliability Gates - Sanitizer, deterministic simulation, chaos, and workload gates
- boilstream-admin CLI - Cluster management commands
- Multi-Tenancy - Tenant isolation in clusters
- Architecture - Overall system architecture