Cluster Mode
BoilStream supports horizontal scaling through S3-based leader election and distributed catalog management.
Beta
Cluster mode is in beta. The API and behavior may change in future releases.
Architecture
Leader Election
Leader election uses S3 conditional writes for distributed consensus:
- Nodes compete to write leader claim to S3
- Winner becomes leader, manages catalog operations
- Other nodes become brokers, redirect admin requests to leader
- Heartbeats maintain leadership; stale leaders are replaced
Configuration
cluster_mode:
enabled: true
node_id_path: "./node_id" # Persistent ID (over restarts)
# advertised_host = in-cluster DNS used for pod-to-pod gRPC on
# internal_api_port. In Kubernetes this is the headless-Service
# per-pod DNS (e.g. boilstream-0.boilstream-headless.<ns>.svc.cluster.local).
advertised_host: "node1.internal.example.com"
# public_host = externally-resolvable hostname. Goes into vended
# DuckLake credentials and admin redirects. Leave unset for
# standalone deployments — it falls back to advertised_host.
public_host: "boilstream-0.app.example.com"
# public_tcp_port = the per-pod gateway TCP listener that exposes
# PGWire without TLS-passthrough SNI gymnastics. The Helm chart
# sets this to 15432 + pod_index automatically. Standalone /
# bare-VM deployments leave it unset and clients use pgwire.port.
# public_tcp_port: 15432
internal_api_port: 8444 # Inter-node communication (PGWire SCRAM RPC, user-context RPC, bootstrap-token RPC)
leader_heartbeat_interval_secs: 30
leader_stale_threshold_secs: 120
broker_heartbeat_interval_secs: 30
broker_stale_threshold_secs: 120Node Roles
Leader Node
- Handles all catalog management operations (create/delete DuckLakes, user management)
- Coordinates cluster state
- Performs catalog backups to S3
- Processes admin API requests
Broker Nodes
- Handle data ingestion (FlightRPC, Kafka, HTTP/2)
- Serve read queries via PostgreSQL interface
- Redirect admin requests to leader
- Can be promoted to leader if current leader fails
Leader-RPC fan-out for stateful auth
Some auth paths can't be served from a worker's local DuckDB because the underlying state (vended credentials, in-flight bootstrap tokens) lives only on the leader. Workers fan out to the leader over the internal mTLS API on internal_api_port:
| Worker need | Leader RPC | What it returns |
|---|---|---|
| Validate a PGWire SCRAM handshake | GetScramHash(username) | The vended SCRAM-SHA-256 hash |
| Resolve a PGWire username → tenant context | GetPgwireUserContext(username) | (internal_user_id, user_id, tenant_id) |
| Issue a bootstrap token | GenerateBootstrapToken(...) | A signed token bound to the cluster secret |
This means any pod can terminate a PGWire / FlightSQL / admin connection without rejecting authenticated clients during a leader transition. Failover is transparent on existing TCP sessions.
Broker States
| State | Description |
|---|---|
active | Normal operation |
draining | Finishing work, rejecting new connections (for rolling updates) |
retiring | Preparing for shutdown |
shutdown | Node shutting down |
Operations
View Cluster Status
boilstream-admin cluster statusRolling Updates
# 1. Drain node
boilstream-admin cluster broker update node-123 --state draining
# 2. Wait for connections to finish, then stop/update/restart
# 3. Re-activate
boilstream-admin cluster broker update node-123 --state activeLeader Stepdown
# Graceful stepdown (automatic leader election)
boilstream-admin cluster leader stepdown
# Stepdown with preferred successor
boilstream-admin cluster leader stepdown --preferred node-456High Availability
- Automatic failover: If leader becomes stale, brokers compete for leadership
- No single point of failure: Any node can become leader
- Consistent state: S3 provides durable cluster state storage
- Graceful degradation: Ingestion continues even during leader transitions
Cluster Identity
Nodes belong to the same cluster when they share the same primary storage backend configuration. The cluster state is stored in S3 at:
s3://{storage.bucket}/{storage.prefix}cluster_state/Configure in config.yaml:
storage:
primary_backend:
type: s3
bucket: "my-cluster-bucket"
prefix: "prod/" # Cluster state at: prod/cluster_state/
region: "us-east-1"
access_key: "..."
secret_key: "..."All nodes must use identical storage backend settings to join the same cluster.
S3 State Structure
s3://{bucket}/{prefix}cluster_state/
├── leader.json # Current leader info
├── cluster_secret.json # Shared auth secret (auto-generated)
├── brokers/
│ ├── node-abc.json # Broker heartbeats
│ ├── node-def.json
│ └── node-ghi.json
└── catalogs/
└── backups/ # DuckLake catalog backupsCircuit Breaker
Admin operations are protected by a circuit breaker that blocks changes if:
- Backup system is unhealthy
- S3 is unreachable
- Leader election is in progress
This prevents data loss during infrastructure issues.
Kubernetes
For multi-pod deployments on Kubernetes, use the official Helm chart — it wires up StatefulSet, per-pod Services, Envoy Gateway SNI routing, cert-manager TLS, and cluster-mode mTLS. See Kubernetes Deployment.
Next Steps
- Kubernetes Deployment - Helm chart for multi-pod clusters
- boilstream-admin CLI - Cluster management commands
- Multi-Tenancy - Tenant isolation in clusters
- Architecture - Overall system architecture