Skip to content

Reliability Gates

BoilStream's clustered metadata path is hardened through layered tests that exercise correctness, durability, memory safety, thread safety, and Kubernetes-operational behavior.

The gates are split by boundary:

BoundaryGate familyWhat it proves
Memory safetyASAN, UBSAN, LSan, macOS leak checksUse-after-free, undefined behavior, and leak regressions fail the gate instead of becoming latent cluster faults
Thread safetyTSAN buildsConcurrent Raft, transport, delayed-task, and snapshot paths do not introduce data races under sanitizer execution
DeterminismSeeded deterministic simulationThe same virtual-time network, fault, tenant, table, and writer schedule produces the same committed trace and canonical state hash
Formal model checkingTLA+/TLC models (P1/P2/P3) plus falsifying mutantsModeled invariants (safety, log ordering, group-commit durability) hold, and deliberately broken mutants are caught
Production boundaryC++/DuckDB/Quack boundary probes (probe sweep)The simulator crosses into real typed mutation and coalescing paths instead of only testing a detached model
OS-process behaviorMulti-process mTLS cluster smoke testsSeparate server processes converge, elect leaders, reseed nodes, and keep durable state coherent
ChaosLeader/follower kill, restart, coordinator-loss, and re-election nemesesA healthy quorum stays responsive and committed state remains convergent through expected pod/node failures
DurabilityRestart, catchup, NuRaft snapshot reseed, and corrupt-log gatesNodes recover from local durable state, reseed from peers, or fail loudly instead of silently diverging
Workload compatibilityDuckLake metadata workload, ATTACH storm, typed mutation oracle, and standalone-vs-cluster differential testsClustered behavior matches the standalone DuckDB/DuckLake contract for the supported metadata workload
Snapshot reseedNuRaft logical snapshot-transfer gatesKilled nodes reseed through real peer-to-peer NuRaft snapshot chunks; wrong-group or backwards installs are rejected
Kubernetes operationsHelm render tests, PDB/readiness/PVC chart assertions, rolling-drain behaviorThe chart exposes the right services and protects quorum during routine Kubernetes operations

Release Gate Groups

The Quack Multi-Raft (clustered-table and cp-metadata) gates are grouped into repeatable release checks. The exact internal harness names are not part of the public API, but every release candidate must pass the relevant gate groups below.

Gate groupPurpose
Quack clustered-table release gateRuns the top-level clustered-table correctness, durability, deterministic simulation, and product-path checks
Linux sanitizer soakBuilds with Ninja/ccache and runs ASAN, TSAN, UBSAN, LSan, deterministic simulation, chaos, DuckLake workload stress, and reseed cycles
Sanitizer evidence gateProduces machine-readable ASAN/TSAN/UBSAN/LSan evidence for release review
Memory and product-path gateExercises focused memory, reseed, DuckLake, and clustered product-path scenarios
Deterministic simulation + TLC gateRuns seeded virtual-time network/fault schedules with production-boundary probes, plus TLA+/TLC model checking with falsifying mutants
Multi-Raft probe sweepRuns the full probe sweep (SQL contract, transaction, routing, runtime, snapshot, and compatibility probes) across macOS and Linux
OS-process cluster gateRuns separate server processes through quorum, chaos, durable restart, catchup, and corruption scenarios
HTTP runtime cluster gateExercises a real 3-process mTLS HTTP/2 NuRaft cluster: leader election, coordinator/participant-leader loss, re-election forwarding, and snapshot-transfer reseed
SQL contract + ABI fence gateVerifies the conservative SQL surface, the M7-8 capability fence, and fail-closed behavior on mixed/legacy ABI clusters
Helm render gateAsserts Kubernetes services, ports, generated cp-metadata peers, Quack exposure, PDB, readiness, and PVC wiring

Property And Fuzz Status

BoilStream has deterministic, model-checked, and chaos testing today:

  • Quack Multi-Raft is formally model-checked with TLA+/TLC (P1/P2/P3 models) plus falsifying mutants that must fail.
  • The Quack fork uses a deterministic simulator with tenant/table/writer fanout, replay determinism checks, canonical hashes, and production-boundary probes.
  • A multi-process mTLS HTTP/2 cluster harness runs chaos slices (leader/follower kill, coordinator loss, re-election forwarding) and NuRaft logical snapshot-transfer reseed.

BoilStream does not currently claim libFuzzer or AFL as first-class public gates. Those are natural future additions; the current high-value correctness gates are TLA+/TLC model checking, deterministic simulation, OS-process chaos, and production-boundary probes.

Kubernetes Readiness Contract

In Kubernetes, the chart protects clustered correctness through:

  • Stable StatefulSet ordinals and headless-Service DNS for node identity.
  • cp-metadata (Quack Multi-Raft) enabled by default for multi-pod metadata convergence.
  • PVC-backed /data for local DuckDB metadata, catalog files, Raft log/snapshot state, and hot-tier files.
  • PodDisruptionBudget defaulting to maxUnavailable: 1 for 3-replica quorum protection.
  • PostgreSQL-protocol readiness/liveness probes so a pod that accepts TCP but cannot complete the protocol handshake is removed from Gateway backends.
  • preStop delay plus server shutdown handling for drain, S3 flush, and leadership handoff.
  • Bulk analytical data lands in S3 via DuckLake hot→cold tiering, kept outside the Raft write path.

Operational Reading

Healthy clusters should preserve these invariants:

  • With 3 voters, any 2 healthy voters keep the cp-metadata group available.
  • A node that cannot apply committed metadata marks itself degraded and should fail readiness.
  • An empty or killed node reseeds from peers via NuRaft logical snapshot transfer before tailing the live log.
  • Wrong-group, backwards, corrupt, missing, or incompatible snapshot/log artifacts fail closed.
  • Bulk-data S3 availability affects DuckLake reads/writes, not Raft quorum.

See Multi-Raft, Cluster Mode, and Kubernetes Deployment for the runtime contract.