Introduction
BoilStream is a Streaming Ingestion Lakehouse — a unified platform that combines high-performance data ingestion, real-time stream processing, full-text search, and managed Data Lake (DuckLake) catalogs. Built with Rust and Apache Arrow, it handles the complete data lifecycle from ingestion to analytics with just SQL.
Key innovations: Concurrent Parquet row group streaming with S3 Multipart Uploads, JIT Avro-to-Arrow decoder (3-5x faster than arrow-avro), materialized views with continuous DuckDB SQL queries, integrated Tantivy full-text search with hot/cold tiering, and a real-time SSE consumer for pushing data to browsers and services.
The Streaming Data Flow
BoilStream organizes around a continuous data flow — from ingestion through transformation, aggregation, search, and consumption. Each stage is built in and operates with just SQL.
Ingest → Transform → Aggregate → Search → ConsumeIngest
Stream data in through multiple interfaces at 10M+ rows/second:
- Arrow FlightRPC — High-throughput Arrow-native ingestion
- HTTP/2 Arrow POST — Simple HTTP ingestion with Arrow IPC payloads
- Kafka Protocol — BoilStream speaks Kafka wire protocol — produce with any Kafka client, with Schema Registry and Confluent binary Avro support
- PostgreSQL COPY —
COPY ... FROM STDINvia PGWire protocol
Data lands in hot tier (DuckDB) for immediate queries and streams to cold tier (Parquet on S3) concurrently — no staging, no ETL.
Transform
Apply continuous row-by-row transformations with streaming views:
CREATE STREAMING VIEW click_events AS
SELECT * FROM events WHERE event_type = 'click';Streaming views filter, project, and transform every row as it arrives. Each view produces its own derived topic with independent hot and cold tiers. Views can cascade — build pipelines by chaining streaming views on streaming views. See Materialized Views for details.
Aggregate
Run windowed aggregations with materialized views:
CREATE MATERIALIZED VIEW sales_per_minute AS
SELECT SUM(amount) AS total, COUNT(*) AS cnt
FROM orders
WITH (window_type='tumbling', window_size='1 minute', timestamp_column='ts');Tumbling and sliding windows execute DuckDB SQL on each window close. Output flows back through the full ingestion pipeline — hot tier, cold tier, CDC, and downstream views. See Materialized Views for details.
Search
Query ingested data with integrated Tantivy full-text search:
ALTER TABLE docs__stream.main.articles SET (
tantivy_enabled = true, tantivy_text_fields = 'title,body'
);
SELECT title, _score FROM multilake_search(
'docs__stream', 'articles__tantivy_idx', 'distributed systems'
) ORDER BY _score DESC;Every inserted row is automatically indexed. Hot tier indexes are searchable within seconds; cold tier bundles are uploaded to S3 and registered in DuckLake for durable distributed search. No external search infrastructure required. See Full-Text Search for details.
Consume
Push real-time data to browsers and services:
- SSE Consumer — Server-Sent Events with Arrow IPC batches, automatic reconnection and catchup replay
- JavaScript SDK —
@boilstream/consumerfor browser and Node.js with flechette Arrow decoding - FlightSQL — Standard Arrow Flight SQL for BI tools and analytics clients
- PostgreSQL Interface — PGWire for Power BI, Tableau, DBeaver, Grafana
Platform Capabilities
Beyond the streaming data flow, BoilStream provides a complete lakehouse platform:
Multi-Tenant DuckDB
Full tenant isolation — each user gets their own DuckDB context with isolated secrets, DuckLakes, filesystem, and attachments. Enterprise SSO with SAML (Entra ID), OAuth, SCIM provisioning, MFA, and Passkeys via a built-in Web Auth GUI.
Managed DuckLakes
Personal data lake catalogs auto-provisioned for each user. PostgreSQL-backed metadata with role-based access control. Register Parquet files and query them via standard SQL.
Distributed DuckDB Compute
BoilStream coordinates distributed DuckDB clients — providing DuckLake catalog access and temporary credential vending while clients process data independently:
- Backend Servers — DuckDB + boilstream extension for server-side analytics
- Desktop/CLI Apps — DuckDB + boilstream extension for local data processing
- Browser Clients — duckdb-wasm + boilstream WASM build for in-browser analytics
No central compute bottleneck — BoilStream coordinates access while compute is distributed across all clients.
Hot/Cold Tiered Storage
Data is queryable in DuckDB within ~1 second of ingestion (hot tier). Concurrently, optimized Parquet files stream to S3 via multipart uploads (cold tier). Cold tier hydration API supports >1GB/s rehydration. Tantivy indexes follow the same hot/cold pattern with .bundle segments on S3.
Production Scale
- 10,000+ concurrent sessions tested
- 3 GB/s sustained throughput
- Horizontal cluster mode with S3-based leader election
- Prometheus metrics with Grafana dashboard support
- Multi-cloud: AWS S3, Azure Blob, GCS, MinIO, filesystem
Use Cases
Streaming Analytics
- Real-time dashboards — Materialized views for continuous aggregations, SSE push to browser dashboards
- Stream processing — Filter, transform, and route data with streaming views — no external stream processor needed
- Cross-topic joins — Query across multiple streams in DuckDB
Search & Discovery
- Log search — Index application logs for fast full-text search across cold storage
- Document search — Full-text search over ingested documents, articles, or support tickets
- Product catalog — Search product descriptions and metadata in real-time
Data Lake Ingestion
- Streaming Lakehouse — Unified ingestion, compute, and catalog management
- ETL Replacement — DuckDB SQL transforms + direct Parquet output eliminates complex pipelines
- Personal Data Lakes — Auto-provisioned DuckLake catalogs for teams and users
- Zero-Copy Analytics — Query S3/cloud data directly via distributed DuckDB clients
IoT & Event Sourcing
- High-volume sensor data — Column-partitioned Parquet files with size-based finalization
- Event streams — Capture, transform, aggregate, and search events in a single platform
Next Steps
Ready to get started? Check out our Quick Start Guide to see BoilStream in action.