Skip to content

Introduction

BoilStream is a Streaming Ingestion Lakehouse — a unified platform that combines high-performance data ingestion, real-time stream processing, full-text search, and managed Data Lake (DuckLake) catalogs. Built with Rust and Apache Arrow, it handles the complete data lifecycle from ingestion to analytics with just SQL.

Key innovations: Concurrent Parquet row group streaming with S3 Multipart Uploads, JIT Avro-to-Arrow decoder (3-5x faster than arrow-avro), materialized views with continuous DuckDB SQL queries, integrated Tantivy full-text search with hot/cold tiering, and a real-time SSE consumer for pushing data to browsers and services.

The Streaming Data Flow

BoilStream organizes around a continuous data flow — from ingestion through transformation, aggregation, search, and consumption. Each stage is built in and operates with just SQL.

Ingest → Transform → Aggregate → Search → Consume

Ingest

Stream data in through multiple interfaces at 10M+ rows/second:

Data lands in hot tier (DuckDB) for immediate queries and streams to cold tier (Parquet on S3) concurrently — no staging, no ETL.

Transform

Apply continuous row-by-row transformations with streaming views:

sql
CREATE STREAMING VIEW click_events AS
  SELECT * FROM events WHERE event_type = 'click';

Streaming views filter, project, and transform every row as it arrives. Each view produces its own derived topic with independent hot and cold tiers. Views can cascade — build pipelines by chaining streaming views on streaming views. See Materialized Views for details.

Aggregate

Run windowed aggregations with materialized views:

sql
CREATE MATERIALIZED VIEW sales_per_minute AS
  SELECT SUM(amount) AS total, COUNT(*) AS cnt
  FROM orders
  WITH (window_type='tumbling', window_size='1 minute', timestamp_column='ts');

Tumbling and sliding windows execute DuckDB SQL on each window close. Output flows back through the full ingestion pipeline — hot tier, cold tier, CDC, and downstream views. See Materialized Views for details.

Query ingested data with integrated Tantivy full-text search:

sql
ALTER TABLE docs__stream.main.articles SET (
  tantivy_enabled = true, tantivy_text_fields = 'title,body'
);

SELECT title, _score FROM multilake_search(
  'docs__stream', 'articles__tantivy_idx', 'distributed systems'
) ORDER BY _score DESC;

Every inserted row is automatically indexed. Hot tier indexes are searchable within seconds; cold tier bundles are uploaded to S3 and registered in DuckLake for durable distributed search. No external search infrastructure required. See Full-Text Search for details.

Consume

Push real-time data to browsers and services:

Platform Capabilities

Beyond the streaming data flow, BoilStream provides a complete lakehouse platform:

Multi-Tenant DuckDB

Full tenant isolation — each user gets their own DuckDB context with isolated secrets, DuckLakes, filesystem, and attachments. Enterprise SSO with SAML (Entra ID), OAuth, SCIM provisioning, MFA, and Passkeys via a built-in Web Auth GUI.

Managed DuckLakes

Personal data lake catalogs auto-provisioned for each user. PostgreSQL-backed metadata with role-based access control. Register Parquet files and query them via standard SQL.

Distributed DuckDB Compute

BoilStream coordinates distributed DuckDB clients — providing DuckLake catalog access and temporary credential vending while clients process data independently:

  • Backend Servers — DuckDB + boilstream extension for server-side analytics
  • Desktop/CLI Apps — DuckDB + boilstream extension for local data processing
  • Browser Clientsduckdb-wasm + boilstream WASM build for in-browser analytics

No central compute bottleneck — BoilStream coordinates access while compute is distributed across all clients.

Hot/Cold Tiered Storage

Data is queryable in DuckDB within ~1 second of ingestion (hot tier). Concurrently, optimized Parquet files stream to S3 via multipart uploads (cold tier). Cold tier hydration API supports >1GB/s rehydration. Tantivy indexes follow the same hot/cold pattern with .bundle segments on S3.

Production Scale

  • 10,000+ concurrent sessions tested
  • 3 GB/s sustained throughput
  • Horizontal cluster mode with S3-based leader election
  • Prometheus metrics with Grafana dashboard support
  • Multi-cloud: AWS S3, Azure Blob, GCS, MinIO, filesystem

Use Cases

Streaming Analytics

  • Real-time dashboards — Materialized views for continuous aggregations, SSE push to browser dashboards
  • Stream processing — Filter, transform, and route data with streaming views — no external stream processor needed
  • Cross-topic joins — Query across multiple streams in DuckDB

Search & Discovery

  • Log search — Index application logs for fast full-text search across cold storage
  • Document search — Full-text search over ingested documents, articles, or support tickets
  • Product catalog — Search product descriptions and metadata in real-time

Data Lake Ingestion

  • Streaming Lakehouse — Unified ingestion, compute, and catalog management
  • ETL Replacement — DuckDB SQL transforms + direct Parquet output eliminates complex pipelines
  • Personal Data Lakes — Auto-provisioned DuckLake catalogs for teams and users
  • Zero-Copy Analytics — Query S3/cloud data directly via distributed DuckDB clients

IoT & Event Sourcing

  • High-volume sensor data — Column-partitioned Parquet files with size-based finalization
  • Event streams — Capture, transform, aggregate, and search events in a single platform

Next Steps

Ready to get started? Check out our Quick Start Guide to see BoilStream in action.