Skip to content

Cluster Mode

BoilStream supports horizontal scaling through stable node identity, S3-backed cluster state, and Quack Multi-Raft for strongly consistent, cluster-wide control-plane writes (the cp-metadata catalog).

Production path

For multi-pod deployments, run cluster mode with the cp-metadata (Quack Multi-Raft) catalog enabled. S3 remains the discovery, backup, and disaster-recovery tier for bulk data; Quack Multi-Raft is the live consistency layer for control-plane metadata and any opt-in cluster-wide-write tables. See Multi-Raft for the consistency model and SQL interface.

Architecture

Cluster mode has two coordination layers:

LayerPurposeStorage
S3 cluster stateLeader discovery, broker heartbeats, shared secrets, cold restore artifacts; bulk analytical data via DuckLake hot→cold tieringcluster_state/ in the primary S3 backend
Quack Multi-Raft (cp-metadata)Strongly consistent catalog/user/topic/tenant metadata (and opt-in cluster-wide-write tables) across nodesPer-node DuckDB catalog + NuRaft quorum over mTLS; consensus owned by quack-raft

Bulk analytical data flows through the hot nodes to object storage (DuckLake hot/cold tiering), not through Raft. Quack Multi-Raft protects the control-plane state that must be coherent before nodes serve catalog and authorization decisions. See Multi-Raft.

Leadership And Consistency

Cluster mode has two leadership surfaces:

  1. S3-backed cluster leadership selects the server that owns cluster coordination and admin routing.
  2. The Quack cp-metadata Multi-Raft group elects a Raft leader that orders catalog, topic, tenant, and grant mutations through quorum. Writes from any node are forwarded to the current leader by quack-raft.

In production multi-pod clusters, the cp-metadata Multi-Raft group is the source of truth for metadata convergence. S3 leadership alone is not the consistency contract for catalog visibility.

Configuration

yaml
cluster_mode:
  enabled: true
  node_id_path: "./node_id"              # Persistent ID (over restarts)

  # advertised_host = in-cluster DNS used for pod-to-pod gRPC on
  # internal_api_port. In Kubernetes this is the headless-Service
  # per-pod DNS (e.g. boilstream-0.boilstream-headless.<ns>.svc.cluster.local).
  advertised_host: "node1.internal.example.com"

  # public_host = externally-resolvable hostname. Goes into vended
  # DuckLake credentials and admin redirects. Leave unset for
  # standalone deployments — it falls back to advertised_host.
  public_host: "boilstream-0.app.example.com"

  # public_tcp_port = the per-pod gateway TCP listener that exposes
  # PGWire without TLS-passthrough SNI gymnastics. The Helm chart
  # sets this to 15432 + pod_index automatically. Standalone /
  # bare-VM deployments leave it unset and clients use pgwire.port.
  # public_tcp_port: 15432

  internal_api_port: 8444                # Inter-node communication (PGWire SCRAM RPC, user-context RPC, bootstrap-token RPC)

  # Strongly consistent, cluster-wide metadata via Quack Multi-Raft (cp-metadata).
  # Replaces the removed metadata_raft_* keys. Consensus is owned by quack-raft.
  cp_metadata_enabled: true
  cp_metadata_port: 8445
  # Voter peers in quack peers= form: node_id:role=quack:host:port
  cp_metadata_peers:
    - "n0:voter=quack:boilstream-0.boilstream-headless.default.svc.cluster.local:8445"
    - "n1:voter=quack:boilstream-1.boilstream-headless.default.svc.cluster.local:8445"
    - "n2:voter=quack:boilstream-2.boilstream-headless.default.svc.cluster.local:8445"
  # Shared token for the cp-group quack endpoint (CREATE SECRET + serve token)
  cp_metadata_token: "..."

  leader_heartbeat_interval_secs: 30
  leader_stale_threshold_secs: 120
  broker_heartbeat_interval_secs: 30
  broker_stale_threshold_secs: 120

Node Roles

Leader Node

  • Handles all catalog management operations (create/delete DuckLakes, user management)
  • Coordinates cluster state
  • Performs catalog backups to S3
  • Proposes cp-metadata Multi-Raft mutations for replicated metadata paths
  • Processes admin API requests

Broker Nodes

  • Handle data ingestion (FlightRPC, Kafka, HTTP/2)
  • Serve read queries via PostgreSQL interface
  • Redirect admin requests to leader
  • Apply committed cp-metadata Multi-Raft entries locally and can serve linearizable metadata reads
  • Can be promoted to leader if current leader fails

Leader-RPC fan-out for stateful auth

Some auth paths can't be served from a worker's local DuckDB because the underlying state (vended credentials, in-flight bootstrap tokens) lives only on the leader. Workers fan out to the leader over the internal mTLS API on internal_api_port:

Worker needLeader RPCWhat it returns
Validate a PGWire SCRAM handshakeGetScramHash(username)The vended SCRAM-SHA-256 hash
Resolve a PGWire username → tenant contextGetPgwireUserContext(username)(internal_user_id, user_id, tenant_id)
Issue a bootstrap tokenGenerateBootstrapToken(...)A signed token bound to the cluster secret

This means any pod can terminate a PGWire / FlightSQL / admin connection without rejecting authenticated clients during a leader transition. Failover is transparent on existing TCP sessions.

Broker States

StateDescription
activeNormal operation
drainingFinishing work, rejecting new connections (for rolling updates)
retiringPreparing for shutdown
shutdownNode shutting down

Operations

View Cluster Status

bash
boilstream-admin cluster status

Rolling Updates

bash
# 1. Drain node
boilstream-admin cluster broker update node-123 --state draining

# 2. Wait for connections to finish, then stop/update/restart

# 3. Re-activate
boilstream-admin cluster broker update node-123 --state active

Leader Stepdown

bash
# Graceful stepdown (automatic leader election)
boilstream-admin cluster leader stepdown

# Stepdown with preferred successor
boilstream-admin cluster leader stepdown --preferred node-456

High Availability

  • Automatic failover: If the cluster leader becomes stale, brokers compete for leadership; the cp-metadata Multi-Raft group continues to preserve committed metadata through quorum
  • No single point of failure: Any node can become leader
  • Consistent metadata: Quack Multi-Raft commits control-plane metadata through quorum and applies it on every healthy node
  • Quorum availability: A 3-voter group remains responsive with any 2 healthy voters
  • Durable bulk data: S3 (DuckLake) holds analytical data via hot→cold tiering, independent of Raft
  • Graceful degradation: Ingestion continues even during leader transitions

Cluster Identity

Nodes belong to the same cluster when they share the same primary storage backend configuration. The cluster state is stored in S3 at:

s3://{storage.bucket}/{storage.prefix}cluster_state/

Configure in config.yaml:

yaml
storage:
  primary_backend:
    type: s3
    bucket: "my-cluster-bucket"
    prefix: "prod/"                    # Cluster state at: prod/cluster_state/
    region: "us-east-1"
    access_key: "..."
    secret_key: "..."

All nodes must use identical storage backend settings to join the same cluster.

S3 State Structure

s3://{bucket}/{prefix}cluster_state/
├── leader.json           # Current leader info
├── cluster_secret.json   # Shared auth secret (auto-generated)
├── brokers/
│   ├── node-abc.json     # Broker heartbeats
│   ├── node-def.json
│   └── node-ghi.json
└── catalogs/
    └── backups/          # DuckLake catalog backups

The cp-metadata Multi-Raft group keeps its own NuRaft log and snapshots locally per node (peer-to-peer reseed via quack-raft); it does not use an S3 snapshot pointer. The former openraft metadata_raft/ S3 layout no longer exists.

Circuit Breaker

Admin operations are protected by a circuit breaker that blocks changes if:

  • Backup system is unhealthy
  • S3 is unreachable
  • Leader election is in progress

This prevents data loss during infrastructure issues.

Kubernetes

For multi-pod deployments on Kubernetes, use the official Helm chart — it wires up StatefulSet, per-pod Services, Envoy Gateway SNI routing, cert-manager TLS, and cluster-mode mTLS. See Kubernetes Deployment.

Next Steps