Reliability Gates
BoilStream's clustered metadata path is hardened through layered tests that exercise correctness, durability, memory safety, thread safety, and Kubernetes-operational behavior.
The gates are split by boundary:
| Boundary | Gate family | What it proves |
|---|---|---|
| Memory safety | ASAN, UBSAN, LSan, macOS leak checks | Use-after-free, undefined behavior, and leak regressions fail the gate instead of becoming latent cluster faults |
| Thread safety | TSAN builds | Concurrent Raft, transport, delayed-task, and snapshot paths do not introduce data races under sanitizer execution |
| Determinism | Seeded deterministic simulation | The same virtual-time network, fault, tenant, table, and writer schedule produces the same committed trace and canonical state hash |
| Formal model checking | TLA+/TLC models (P1/P2/P3) plus falsifying mutants | Modeled invariants (safety, log ordering, group-commit durability) hold, and deliberately broken mutants are caught |
| Production boundary | C++/DuckDB/Quack boundary probes (probe sweep) | The simulator crosses into real typed mutation and coalescing paths instead of only testing a detached model |
| OS-process behavior | Multi-process mTLS cluster smoke tests | Separate server processes converge, elect leaders, reseed nodes, and keep durable state coherent |
| Chaos | Leader/follower kill, restart, coordinator-loss, and re-election nemeses | A healthy quorum stays responsive and committed state remains convergent through expected pod/node failures |
| Durability | Restart, catchup, NuRaft snapshot reseed, and corrupt-log gates | Nodes recover from local durable state, reseed from peers, or fail loudly instead of silently diverging |
| Workload compatibility | DuckLake metadata workload, ATTACH storm, typed mutation oracle, and standalone-vs-cluster differential tests | Clustered behavior matches the standalone DuckDB/DuckLake contract for the supported metadata workload |
| Snapshot reseed | NuRaft logical snapshot-transfer gates | Killed nodes reseed through real peer-to-peer NuRaft snapshot chunks; wrong-group or backwards installs are rejected |
| Kubernetes operations | Helm render tests, PDB/readiness/PVC chart assertions, rolling-drain behavior | The chart exposes the right services and protects quorum during routine Kubernetes operations |
Release Gate Groups
The Quack Multi-Raft (clustered-table and cp-metadata) gates are grouped into repeatable release checks. The exact internal harness names are not part of the public API, but every release candidate must pass the relevant gate groups below.
| Gate group | Purpose |
|---|---|
| Quack clustered-table release gate | Runs the top-level clustered-table correctness, durability, deterministic simulation, and product-path checks |
| Linux sanitizer soak | Builds with Ninja/ccache and runs ASAN, TSAN, UBSAN, LSan, deterministic simulation, chaos, DuckLake workload stress, and reseed cycles |
| Sanitizer evidence gate | Produces machine-readable ASAN/TSAN/UBSAN/LSan evidence for release review |
| Memory and product-path gate | Exercises focused memory, reseed, DuckLake, and clustered product-path scenarios |
| Deterministic simulation + TLC gate | Runs seeded virtual-time network/fault schedules with production-boundary probes, plus TLA+/TLC model checking with falsifying mutants |
| Multi-Raft probe sweep | Runs the full probe sweep (SQL contract, transaction, routing, runtime, snapshot, and compatibility probes) across macOS and Linux |
| OS-process cluster gate | Runs separate server processes through quorum, chaos, durable restart, catchup, and corruption scenarios |
| HTTP runtime cluster gate | Exercises a real 3-process mTLS HTTP/2 NuRaft cluster: leader election, coordinator/participant-leader loss, re-election forwarding, and snapshot-transfer reseed |
| SQL contract + ABI fence gate | Verifies the conservative SQL surface, the M7-8 capability fence, and fail-closed behavior on mixed/legacy ABI clusters |
| Helm render gate | Asserts Kubernetes services, ports, generated cp-metadata peers, Quack exposure, PDB, readiness, and PVC wiring |
Property And Fuzz Status
BoilStream has deterministic, model-checked, and chaos testing today:
- Quack Multi-Raft is formally model-checked with TLA+/TLC (P1/P2/P3 models) plus falsifying mutants that must fail.
- The Quack fork uses a deterministic simulator with tenant/table/writer fanout, replay determinism checks, canonical hashes, and production-boundary probes.
- A multi-process mTLS HTTP/2 cluster harness runs chaos slices (leader/follower kill, coordinator loss, re-election forwarding) and NuRaft logical snapshot-transfer reseed.
BoilStream does not currently claim libFuzzer or AFL as first-class public gates. Those are natural future additions; the current high-value correctness gates are TLA+/TLC model checking, deterministic simulation, OS-process chaos, and production-boundary probes.
Kubernetes Readiness Contract
In Kubernetes, the chart protects clustered correctness through:
- Stable StatefulSet ordinals and headless-Service DNS for node identity.
cp-metadata(Quack Multi-Raft) enabled by default for multi-pod metadata convergence.- PVC-backed
/datafor local DuckDB metadata, catalog files, Raft log/snapshot state, and hot-tier files. PodDisruptionBudgetdefaulting tomaxUnavailable: 1for 3-replica quorum protection.- PostgreSQL-protocol readiness/liveness probes so a pod that accepts TCP but cannot complete the protocol handshake is removed from Gateway backends.
preStopdelay plus server shutdown handling for drain, S3 flush, and leadership handoff.- Bulk analytical data lands in S3 via DuckLake hot→cold tiering, kept outside the Raft write path.
Operational Reading
Healthy clusters should preserve these invariants:
- With 3 voters, any 2 healthy voters keep the
cp-metadatagroup available. - A node that cannot apply committed metadata marks itself degraded and should fail readiness.
- An empty or killed node reseeds from peers via NuRaft logical snapshot transfer before tailing the live log.
- Wrong-group, backwards, corrupt, missing, or incompatible snapshot/log artifacts fail closed.
- Bulk-data S3 availability affects DuckLake reads/writes, not Raft quorum.
See Multi-Raft, Cluster Mode, and Kubernetes Deployment for the runtime contract.