Skip to content

Introduction

BoilStream is a Streaming Ingestion Lakehouse - a unified platform that combines high-performance data ingestion, DuckDB server capabilities, and managed Data Lake (DuckLake) catalogs. Built with Rust and Apache Arrow, it serves as both a streaming data platform and a DuckDB compute layer with remote secrets store support for distributed DuckDB clients via the boilstream community extension.

Key innovations: Concurrent Parquet row group streaming with S3 Multipart Uploads (optionally padded to 5MB for optimal analytics performance), JIT Avro-to-Arrow decoder (3-5x faster than arrow-avro), user-defined column partitioning, materialized views with never-ending realtime DuckDB SQL queries, and size-based file finalization to avoid small file fragmentation.

What is BoilStream?

BoilStream is a Streaming Ingestion Lakehouse that unifies data ingestion, compute, and catalog management:

Core Capabilities

  • Web Auth GUI - Built-in web dashboard for user authentication, credential vending, and platform administration
    • User Dashboard: Get PostgreSQL credentials and JWT tokens after OAuth/SAML login
    • Superadmin Dashboard: Configure SAML SSO providers, manage users, etc.
  • DuckDB Server - Remote DuckDB clients connect via BoilStream extension with centralized secrets management
  • Managed DuckLakes - Personal data lake catalogs with embedded PostgreSQL, auto-provisioned for each user
  • Streaming Ingestion - 10M+ rows/second to cloud storage (diskless) and local DuckDB
  • SQL-First - Transform data using familiar DuckDB SQL syntax
  • Direct Parquet - Analytics-ready files, no intermediate processing
  • Multiple Interfaces - FlightRPC, HTTP/2 Arrow, Kafka protocol, PostgreSQL wire protocol
  • BI Tool Ready - Power BI, Tableau, DBeaver via PostgreSQL interface
  • Multi-Cloud - AWS S3, Azure Blob, GCS, MinIO, filesystem
  • Enterprise Security - TLS encryption, SAML/OAuth SSO, IAM integration

Key Benefits

Unified Lakehouse Architecture

Traditional data platforms require separate systems for ingestion, compute, and catalog management. BoilStream unifies these into a single platform:

Traditional Architecture:

  • Stream ingestion clusters → Processing workers → Data lake → Catalog service → Query engine

BoilStream Architecture:

  • Single platform: Ingestion + DuckDB compute + DuckLake catalogs + Secrets store

Distributed DuckDB Compute

BoilStream acts as a central coordinator for distributed DuckDB clients - providing DuckLake catalog access and temporary credential vending while clients process data independently:

BoilStream provides:

  • DuckLake Catalog - Central metadata: what data exists, where it's stored
  • Credential Vending - Short-lived S3/cloud tokens (no permanent keys on clients)
  • Multi-Tenant Isolation - Per-user/team data lake access control

Clients query data directly:

  • Backend Servers - DuckDB + boilstream extension for server-side analytics
  • Desktop/CLI Apps - DuckDB + boilstream extension for local data processing
  • Browser Clients - duckdb-wasm + boilstream WASM build for in-browser analytics

Each client connects to BoilStream for catalog info and temporary credentials, then queries S3/cloud storage directly with full DuckDB power. No central compute bottleneck - BoilStream coordinates access while compute is distributed across all clients.

Managed DuckLakes

Personal data lake catalogs with automatic provisioning:

  • Auto-Provisioned: New users automatically get their own DuckLake catalog
  • PostgreSQL-Backed: Embedded PostgreSQL server for catalog metadata
  • SQL Interface: Register Parquet files and query them via standard SQL
  • Access Control: Role-based permissions on catalogs and tables

Immediate Analytics

  • Cloud Storage: Optimized Parquet files ready for analytics
  • Local DuckDB: Ultra-fast queries and cross-topic joins
  • No ETL Pipeline: Direct streaming to analytics-ready format
  • DuckLake Integration: Automatic file registration in catalogs

Production Scale

Built for enterprise workloads:

  • 10,000+ concurrent sessions tested
  • 3 GB/s sustained throughput
  • Transactions preserving Parquet metadata columns
  • Prometheus metrics with Grafana dashboard support

Use Cases

BoilStream is perfect for:

Data Lake Scenarios

  • Streaming Lakehouse: Unified platform for ingestion, compute, and catalog management
  • Remote DuckDB Compute: Distributed analytics with centralized secrets and catalog
  • Personal Data Lakes: Auto-provisioned DuckLake catalogs for teams and users
  • Zero-Copy Analytics: Query S3/cloud data directly via DuckDB with no data movement

Real-Time Analytics

  • Immediate Analysis: Stream data for instant queries with dual storage (cloud + local)
  • Cross-Topic Analytics: Join and analyze data across multiple streams
  • BI Tool Integration: Direct connectivity via PostgreSQL wire protocol and FlightSQL
  • Time-Series Analysis: Window queries over historical data (future roadmap)

Data Engineering

  • Data Lake Ingestion: Populate data lakes with minimal latency via diskless pipeline
  • Log Processing: Ingest and transform log files in real-time
  • IoT Data: Handle high-volume sensor data streams
  • Event Sourcing: Capture and store event streams with backup-free architecture
  • ETL Replacement: DuckDB SQL transforms + direct Parquet output eliminates complex pipelines

Next Steps

Ready to get started? Check out our Quick Start Guide to see BoilStream in action.