Skip to content

Multi-Raft

Multi-Raft is BoilStream's strongly consistent, cluster-wide write layer. It is provided by the DuckDB Quack fork (NuRaft-based consensus over mTLS HTTP/2) and replaces the earlier metadata_raft (openraft) module, which no longer exists.

It is built for the part of a data platform that must be correct before it is fast: the DuckLake catalog, user/catalog grants, topic registration, tenant membership, and any other table that needs a single agreed-upon write history across every node.

What goes through Multi-Raft, and what does not

Multi-Raft carries cluster-wide consistent (quorum) writes — primarily the control-plane metadata (DuckLake catalog, grants, topics, tenants), and optionally any application table you explicitly opt into a Raft group. It is not the bulk ingestion path: high-throughput streaming data flows through the hot nodes to S3 (DuckLake hot/cold tiering), not through Raft. Use Multi-Raft for correctness-critical state, not for firehose volume.

What It Solves

In a multi-node BoilStream cluster, every pod runs its own DuckDB process. Without a consensus layer, a follower can briefly see stale or locally divergent metadata during failover, rolling upgrades, or fast catalog changes.

Multi-Raft gives the cluster a single ordered write history per Raft group:

  1. A write is proposed to the group's Raft leader (routed automatically from any node).
  2. The leader commits it through quorum.
  3. Every healthy node applies the same committed entry into its local DuckDB catalog.
  4. Reads can use consistency 'linearizable' when they need fresh, leader-confirmed state.

If a node cannot apply a committed entry, it fails loudly rather than serving divergent state.

The Single-Shared-Catalog Model

Multi-Raft uses a single-shared-catalog, full-replication model:

  • Every node holds all data. Every node applies every committed entry from every group into the one shared DuckDB catalog on that node.
  • Groups (write "lanes") are commit parallelism, not data partitioning. Sharding a table across N groups parallelizes consensus, not storage. Reads never need a cross-shard union — all rows are present on every node.
  • Resharding is a metadata epoch flip, not a data copy. ALTER TABLE ... SET GROUPS n re-reconciles ownership onto a new group set with a higher shard epoch. No physical data moves between catalogs.

A direct consequence: throughput scales with consensus parallelism up to a point, then caps at the single node's DuckDB apply speed, because every node applies everything.

Control-Plane Metadata (cp-metadata)

The primary built-in use of Multi-Raft is the control-plane metadata catalog (cp-metadata). When enabled, the node serves and ATTACHes the metadata-control-plane quack cluster group, and all metadata writes route through it. Quack-raft owns consensus and leader forwarding — there is no boilstream-side Raft node, snapshot store, or leader-routing layer (this is the structural change from the old openraft metadata_raft module).

What the control plane keeps coherent across nodes:

StatePurpose
DuckLake catalog rowsWhich catalogs exist and who owns them
Catalog grantsWhich users can access which catalogs
Topic metadataStreaming topic/table registration
Tenant membershipTenant/user relationships needed for authorization

SQL Interface for Multi-Raft Tables

Beyond the control plane, you can put application tables under Multi-Raft when they need cluster-wide consistent writes. The interface is plain DuckDB SQL against a Quack-attached catalog.

Attach (the safe default)

sql
ATTACH 'quack:host:port' AS cp (TYPE quack, TOKEN '...');

For linearizable reads and the mandatory cluster mTLS, attach with the full contract:

sql
ATTACH 'quack:127.0.0.1:5001' AS cp (
  TYPE quack,
  consistency 'linearizable',
  cert_file 'certs/client.crt',
  key_file 'certs/client.key',
  client_ca_file 'certs/ca.crt'
);

With a plain ATTACH, tables live in the tenant/catalog default group and you get ordinary atomic multi-table DuckDB transactions inside that catalog. This is the recommended model for most workloads.

Enable explicit multi-group SQL

Per-table groups and hash sharding are an advanced, conservative-by-default surface. Enable it explicitly:

sql
SELECT quack_multiraft_sql_set_enabled(true);

Pin a table to one named Raft group

Use this for table-level isolation — give a hot or correctness-critical table its own consensus lane:

sql
CREATE TABLE cp.app.orders (
  id BIGINT,
  value VARCHAR
) WITH (quack_group = 'orders-group');

Hash-shard a table across groups

Use this for high-write tables that benefit from parallel Raft groups. Rows are routed by quack_shard_key:

sql
CREATE TABLE cp.app.events (
  id BIGINT,
  tenant_id VARCHAR,
  payload VARCHAR
) WITH (
  quack_shard_key = 'tenant_id',
  quack_initial_groups = 8
);

Reshard later (metadata epoch flip, no data copy)

sql
ALTER TABLE cp.app.events SET GROUPS 16;

Return a table to the default group

sql
ALTER TABLE cp.app.orders SET GROUP DEFAULT;

Inspect placement and contract

sql
SELECT * FROM quack_table_groups('cp');
SELECT * FROM quack_group_status('cp');
SELECT * FROM quack_multiraft_sql_contract_status();

Rule of thumb

Use quack_group for table-level isolation; use quack_shard_key + quack_initial_groups for high-write tables that need parallel Raft groups. Normal clients should not hand-pick raft group_id values — the root catalog/router owns placement.

Performance

In a 3-node local Multi-Raft benchmark with mTLS, on-disk Raft logs, and separate OS processes per node, Quack reached about 100k inserted rows/sec using 4 Raft write lanes and 1024-row batched INSERTs. Single-row INSERTs are much slower — around 1k rows/sec at 4 lanes — so batching is the primary throughput lever. Multi-Raft lanes improve throughput by parallelizing consensus, while DuckDB apply on each node becomes the next bottleneck.

Read these numbers correctly

  • These are benchmark numbers, not a universal guarantee.
  • Host was Apple arm64, 12 cores, local 3-node cluster.
  • Rows/sec depends heavily on row shape, disk, CPU, network, batch size, and durability settings.
  • This is the single-shared-catalog model: every node applies all committed writes, so throughput eventually caps at per-node DuckDB apply speed.
  • Best practice is multi-row INSERTs of 256–1024 rows per statement, not high-concurrency single-row writes.

Note on terminology: the ~100k figure is rows/sec in batched INSERTs, not 100k independent transactions/sec, and not "cluster-wide quorum write throughput" — it is per-group quorum plus shared apply. Throughput does not scale linearly with the number of shards/groups.

A Kubernetes benchmark of batched throughput against the 3-node Hetzner cluster is pending; the numbers above are from a single host running the three nodes as separate OS processes.

Configuration

Enable the control-plane metadata catalog under cluster_mode:

yaml
cluster_mode:
  enabled: true
  # Quack-native control-plane metadata catalog (replaces the old metadata_raft_* keys)
  cp_metadata_enabled: true
  cp_metadata_port: 8445
  # Voter peers in quack peers= form: node_id:role=quack:host:port
  cp_metadata_peers:
    - "n0:voter=quack:boilstream-0.boilstream-headless.default.svc.cluster.local:8445"
    - "n1:voter=quack:boilstream-1.boilstream-headless.default.svc.cluster.local:8445"
    - "n2:voter=quack:boilstream-2.boilstream-headless.default.svc.cluster.local:8445"
  # Shared token for the cp-group quack endpoint (CREATE SECRET + serve token)
  cp_metadata_token: "..."

mTLS is mandatory for cluster traffic. Each node certificate must be signed by the cluster CA and carry DNS:<node_id>, URI:<node_id>, and extendedKeyUsage = serverAuth, clientAuth (every node is both a Raft server and client). CN-only identity is rejected.

In Kubernetes, use stable StatefulSet identities and persistent local storage; the Helm chart wires the peer list, ports, and mTLS for you.

Operations

Check cp-metadata health from any node:

bash
boilstream-admin cp-metadata status

This reports the quack quack_raft_status() rows plus a derived health verdict:

FieldMeaning
enabledThis node has the cp catalog attached (it is a cp voter and the group is reachable)
okInitialized, a leader is known, and linearizable reads are safe
leader_idCurrent cp-group Raft leader
is_leaderWhether the queried node is the leader
linearizable_read_safeWhether a linearizable read can be served safely now

Local multi-node dev cluster tooling:

bash
# Validate a set of per-node configs
boilstream-admin cp-metadata validate-config node0.yaml node1.yaml node2.yaml

# Generate deterministic local OS-process configs for N nodes
boilstream-admin cp-metadata generate-local-configs --nodes 3

Consistency And High Availability

  • Quorum availability: a 3-voter group stays available with any 2 healthy voters.
  • Automatic failover: on leader loss, the group re-elects and committed writes survive through quorum; writes route to the new leader transparently.
  • Linearizable reads: per group, via leader leases with a ReadIndex fallback; on leadership loss the node will not serve stale linearizable reads.
  • Fail-closed apply: a node that cannot apply a committed entry latches degraded and stops serving divergent state.
  • Bulk data is separate: ingestion continues through the hot-node → S3 path independent of Raft leadership transitions.

Limits

Multi-Raft is for cluster-wide consistent writes, not for bulk data volume:

  • Use it for metadata, coordination, and correctness-critical tables.
  • Do not route firehose ingestion through it — that is the hot/cold tiering path to S3.
  • Scale write throughput with multi-row INSERT batching first, then a modest number of lanes (~4); more lanes give diminishing returns because every node applies everything.
  • Keep voter sets small, normally three voters.

See Also