Introduction
BoilStream is a Streaming Ingestion Lakehouse - a unified platform that combines high-performance data ingestion, DuckDB server capabilities, and managed Data Lake (DuckLake) catalogs. Built with Rust and Apache Arrow, it serves as both a streaming data platform and a DuckDB compute layer with remote secrets store support for distributed DuckDB clients via the boilstream community extension.
Key innovations: Concurrent Parquet row group streaming with S3 Multipart Uploads (optionally padded to 5MB for optimal analytics performance), JIT Avro-to-Arrow decoder (3-5x faster than arrow-avro), user-defined column partitioning, materialized views with never-ending realtime DuckDB SQL queries, and size-based file finalization to avoid small file fragmentation.
What is BoilStream?
BoilStream is a Streaming Ingestion Lakehouse that unifies data ingestion, compute, and catalog management:
Core Capabilities
- Web Auth GUI - Built-in web dashboard for user authentication, credential vending, and platform administration
- User Dashboard: Get PostgreSQL credentials and JWT tokens after OAuth/SAML login
- Superadmin Dashboard: Configure SAML SSO providers, manage users, etc.
- DuckDB Server - Remote DuckDB clients connect via BoilStream extension with centralized secrets management
- Managed DuckLakes - Personal data lake catalogs with embedded PostgreSQL, auto-provisioned for each user
- Streaming Ingestion - 10M+ rows/second to cloud storage (diskless) and local DuckDB
- SQL-First - Transform data using familiar DuckDB SQL syntax
- Direct Parquet - Analytics-ready files, no intermediate processing
- Multiple Interfaces - FlightRPC, HTTP/2 Arrow, Kafka protocol, PostgreSQL wire protocol
- BI Tool Ready - Power BI, Tableau, DBeaver via PostgreSQL interface
- Multi-Cloud - AWS S3, Azure Blob, GCS, MinIO, filesystem
- Enterprise Security - TLS encryption, SAML/OAuth SSO, IAM integration
Key Benefits
Unified Lakehouse Architecture
Traditional data platforms require separate systems for ingestion, compute, and catalog management. BoilStream unifies these into a single platform:
Traditional Architecture:
- Stream ingestion clusters → Processing workers → Data lake → Catalog service → Query engine
BoilStream Architecture:
- Single platform: Ingestion + DuckDB compute + DuckLake catalogs + Secrets store
Distributed DuckDB Compute
BoilStream acts as a central coordinator for distributed DuckDB clients - providing DuckLake catalog access and temporary credential vending while clients process data independently:
BoilStream provides:
- DuckLake Catalog - Central metadata: what data exists, where it's stored
- Credential Vending - Short-lived S3/cloud tokens (no permanent keys on clients)
- Multi-Tenant Isolation - Per-user/team data lake access control
Clients query data directly:
- Backend Servers - DuckDB + boilstream extension for server-side analytics
- Desktop/CLI Apps - DuckDB + boilstream extension for local data processing
- Browser Clients - duckdb-wasm + boilstream WASM build for in-browser analytics
Each client connects to BoilStream for catalog info and temporary credentials, then queries S3/cloud storage directly with full DuckDB power. No central compute bottleneck - BoilStream coordinates access while compute is distributed across all clients.
Managed DuckLakes
Personal data lake catalogs with automatic provisioning:
- Auto-Provisioned: New users automatically get their own DuckLake catalog
- PostgreSQL-Backed: Embedded PostgreSQL server for catalog metadata
- SQL Interface: Register Parquet files and query them via standard SQL
- Access Control: Role-based permissions on catalogs and tables
Immediate Analytics
- Cloud Storage: Optimized Parquet files ready for analytics
- Local DuckDB: Ultra-fast queries and cross-topic joins
- No ETL Pipeline: Direct streaming to analytics-ready format
- DuckLake Integration: Automatic file registration in catalogs
Production Scale
Built for enterprise workloads:
- 10,000+ concurrent sessions tested
- 3 GB/s sustained throughput
- Transactions preserving Parquet metadata columns
- Prometheus metrics with Grafana dashboard support
Use Cases
BoilStream is perfect for:
Data Lake Scenarios
- Streaming Lakehouse: Unified platform for ingestion, compute, and catalog management
- Remote DuckDB Compute: Distributed analytics with centralized secrets and catalog
- Personal Data Lakes: Auto-provisioned DuckLake catalogs for teams and users
- Zero-Copy Analytics: Query S3/cloud data directly via DuckDB with no data movement
Real-Time Analytics
- Immediate Analysis: Stream data for instant queries with dual storage (cloud + local)
- Cross-Topic Analytics: Join and analyze data across multiple streams
- BI Tool Integration: Direct connectivity via PostgreSQL wire protocol and FlightSQL
- Time-Series Analysis: Window queries over historical data (future roadmap)
Data Engineering
- Data Lake Ingestion: Populate data lakes with minimal latency via diskless pipeline
- Log Processing: Ingest and transform log files in real-time
- IoT Data: Handle high-volume sensor data streams
- Event Sourcing: Capture and store event streams with backup-free architecture
- ETL Replacement: DuckDB SQL transforms + direct Parquet output eliminates complex pipelines
Next Steps
Ready to get started? Check out our Quick Start Guide to see BoilStream in action.