Multi-Raft
Multi-Raft is BoilStream's strongly consistent, cluster-wide write layer. It is provided by the DuckDB Quack fork (NuRaft-based consensus over mTLS HTTP/2) and replaces the earlier metadata_raft (openraft) module, which no longer exists.
It is built for the part of a data platform that must be correct before it is fast: the DuckLake catalog, user/catalog grants, topic registration, tenant membership, and any other table that needs a single agreed-upon write history across every node.
What goes through Multi-Raft, and what does not
Multi-Raft carries cluster-wide consistent (quorum) writes — primarily the control-plane metadata (DuckLake catalog, grants, topics, tenants), and optionally any application table you explicitly opt into a Raft group. It is not the bulk ingestion path: high-throughput streaming data flows through the hot nodes to S3 (DuckLake hot/cold tiering), not through Raft. Use Multi-Raft for correctness-critical state, not for firehose volume.
What It Solves
In a multi-node BoilStream cluster, every pod runs its own DuckDB process. Without a consensus layer, a follower can briefly see stale or locally divergent metadata during failover, rolling upgrades, or fast catalog changes.
Multi-Raft gives the cluster a single ordered write history per Raft group:
- A write is proposed to the group's Raft leader (routed automatically from any node).
- The leader commits it through quorum.
- Every healthy node applies the same committed entry into its local DuckDB catalog.
- Reads can use
consistency 'linearizable'when they need fresh, leader-confirmed state.
If a node cannot apply a committed entry, it fails loudly rather than serving divergent state.
The Single-Shared-Catalog Model
Multi-Raft uses a single-shared-catalog, full-replication model:
- Every node holds all data. Every node applies every committed entry from every group into the one shared DuckDB catalog on that node.
- Groups (write "lanes") are commit parallelism, not data partitioning. Sharding a table across N groups parallelizes consensus, not storage. Reads never need a cross-shard union — all rows are present on every node.
- Resharding is a metadata epoch flip, not a data copy.
ALTER TABLE ... SET GROUPS nre-reconciles ownership onto a new group set with a higher shard epoch. No physical data moves between catalogs.
A direct consequence: throughput scales with consensus parallelism up to a point, then caps at the single node's DuckDB apply speed, because every node applies everything.
Control-Plane Metadata (cp-metadata)
The primary built-in use of Multi-Raft is the control-plane metadata catalog (cp-metadata). When enabled, the node serves and ATTACHes the metadata-control-plane quack cluster group, and all metadata writes route through it. Quack-raft owns consensus and leader forwarding — there is no boilstream-side Raft node, snapshot store, or leader-routing layer (this is the structural change from the old openraft metadata_raft module).
What the control plane keeps coherent across nodes:
| State | Purpose |
|---|---|
| DuckLake catalog rows | Which catalogs exist and who owns them |
| Catalog grants | Which users can access which catalogs |
| Topic metadata | Streaming topic/table registration |
| Tenant membership | Tenant/user relationships needed for authorization |
SQL Interface for Multi-Raft Tables
Beyond the control plane, you can put application tables under Multi-Raft when they need cluster-wide consistent writes. The interface is plain DuckDB SQL against a Quack-attached catalog.
Attach (the safe default)
ATTACH 'quack:host:port' AS cp (TYPE quack, TOKEN '...');For linearizable reads and the mandatory cluster mTLS, attach with the full contract:
ATTACH 'quack:127.0.0.1:5001' AS cp (
TYPE quack,
consistency 'linearizable',
cert_file 'certs/client.crt',
key_file 'certs/client.key',
client_ca_file 'certs/ca.crt'
);With a plain ATTACH, tables live in the tenant/catalog default group and you get ordinary atomic multi-table DuckDB transactions inside that catalog. This is the recommended model for most workloads.
Enable explicit multi-group SQL
Per-table groups and hash sharding are an advanced, conservative-by-default surface. Enable it explicitly:
SELECT quack_multiraft_sql_set_enabled(true);Pin a table to one named Raft group
Use this for table-level isolation — give a hot or correctness-critical table its own consensus lane:
CREATE TABLE cp.app.orders (
id BIGINT,
value VARCHAR
) WITH (quack_group = 'orders-group');Hash-shard a table across groups
Use this for high-write tables that benefit from parallel Raft groups. Rows are routed by quack_shard_key:
CREATE TABLE cp.app.events (
id BIGINT,
tenant_id VARCHAR,
payload VARCHAR
) WITH (
quack_shard_key = 'tenant_id',
quack_initial_groups = 8
);Reshard later (metadata epoch flip, no data copy)
ALTER TABLE cp.app.events SET GROUPS 16;Return a table to the default group
ALTER TABLE cp.app.orders SET GROUP DEFAULT;Inspect placement and contract
SELECT * FROM quack_table_groups('cp');
SELECT * FROM quack_group_status('cp');
SELECT * FROM quack_multiraft_sql_contract_status();Rule of thumb
Use quack_group for table-level isolation; use quack_shard_key + quack_initial_groups for high-write tables that need parallel Raft groups. Normal clients should not hand-pick raft group_id values — the root catalog/router owns placement.
Performance
In a 3-node local Multi-Raft benchmark with mTLS, on-disk Raft logs, and separate OS processes per node, Quack reached about 100k inserted rows/sec using 4 Raft write lanes and 1024-row batched INSERTs. Single-row INSERTs are much slower — around 1k rows/sec at 4 lanes — so batching is the primary throughput lever. Multi-Raft lanes improve throughput by parallelizing consensus, while DuckDB apply on each node becomes the next bottleneck.
Read these numbers correctly
- These are benchmark numbers, not a universal guarantee.
- Host was Apple arm64, 12 cores, local 3-node cluster.
- Rows/sec depends heavily on row shape, disk, CPU, network, batch size, and durability settings.
- This is the single-shared-catalog model: every node applies all committed writes, so throughput eventually caps at per-node DuckDB apply speed.
- Best practice is multi-row INSERTs of 256–1024 rows per statement, not high-concurrency single-row writes.
Note on terminology: the ~100k figure is rows/sec in batched INSERTs, not 100k independent transactions/sec, and not "cluster-wide quorum write throughput" — it is per-group quorum plus shared apply. Throughput does not scale linearly with the number of shards/groups.
A Kubernetes benchmark of batched throughput against the 3-node Hetzner cluster is pending; the numbers above are from a single host running the three nodes as separate OS processes.
Configuration
Enable the control-plane metadata catalog under cluster_mode:
cluster_mode:
enabled: true
# Quack-native control-plane metadata catalog (replaces the old metadata_raft_* keys)
cp_metadata_enabled: true
cp_metadata_port: 8445
# Voter peers in quack peers= form: node_id:role=quack:host:port
cp_metadata_peers:
- "n0:voter=quack:boilstream-0.boilstream-headless.default.svc.cluster.local:8445"
- "n1:voter=quack:boilstream-1.boilstream-headless.default.svc.cluster.local:8445"
- "n2:voter=quack:boilstream-2.boilstream-headless.default.svc.cluster.local:8445"
# Shared token for the cp-group quack endpoint (CREATE SECRET + serve token)
cp_metadata_token: "..."mTLS is mandatory for cluster traffic. Each node certificate must be signed by the cluster CA and carry DNS:<node_id>, URI:<node_id>, and extendedKeyUsage = serverAuth, clientAuth (every node is both a Raft server and client). CN-only identity is rejected.
In Kubernetes, use stable StatefulSet identities and persistent local storage; the Helm chart wires the peer list, ports, and mTLS for you.
Operations
Check cp-metadata health from any node:
boilstream-admin cp-metadata statusThis reports the quack quack_raft_status() rows plus a derived health verdict:
| Field | Meaning |
|---|---|
enabled | This node has the cp catalog attached (it is a cp voter and the group is reachable) |
ok | Initialized, a leader is known, and linearizable reads are safe |
leader_id | Current cp-group Raft leader |
is_leader | Whether the queried node is the leader |
linearizable_read_safe | Whether a linearizable read can be served safely now |
Local multi-node dev cluster tooling:
# Validate a set of per-node configs
boilstream-admin cp-metadata validate-config node0.yaml node1.yaml node2.yaml
# Generate deterministic local OS-process configs for N nodes
boilstream-admin cp-metadata generate-local-configs --nodes 3Consistency And High Availability
- Quorum availability: a 3-voter group stays available with any 2 healthy voters.
- Automatic failover: on leader loss, the group re-elects and committed writes survive through quorum; writes route to the new leader transparently.
- Linearizable reads: per group, via leader leases with a ReadIndex fallback; on leadership loss the node will not serve stale linearizable reads.
- Fail-closed apply: a node that cannot apply a committed entry latches degraded and stops serving divergent state.
- Bulk data is separate: ingestion continues through the hot-node → S3 path independent of Raft leadership transitions.
Limits
Multi-Raft is for cluster-wide consistent writes, not for bulk data volume:
- Use it for metadata, coordination, and correctness-critical tables.
- Do not route firehose ingestion through it — that is the hot/cold tiering path to S3.
- Scale write throughput with multi-row INSERT batching first, then a modest number of lanes (~4); more lanes give diminishing returns because every node applies everything.
- Keep voter sets small, normally three voters.