Introduction

BoilStream is a high-performance data ingestion system that provides dual storage capabilities: diskless streaming to your data lake and high-performance local DuckDB persistence. Using familiar SQL syntax and built with Rust and Apache Arrow, it eliminates complex ETL pipelines by landing data as analytics-ready Parquet files immediately while supporting 10+ million rows/second ingestion into local databases.

BoilStream uses innovative concurrent Parquet row group streaming technology integrated with S3 Multipart Uploads. This allows writing a single big Parquet file by concurrently writing row groups and deciding when to finalize the Parquet file and multipart upload based on accumulated size. This avoids high fragmentation and small files on S3 and makes the initial files on S3 already suitable for analytics - like with DuckLake.

What is BoilStream?

BoilStream bridges the gap between data sources and data lakes by providing:

Dual Storage Architecture: Diskless S3 streaming + optional high-performance DuckDB persistence
SQL-First Interface: Use DuckDB's powerful SQL to transform and stream data
Ultra-High Performance: 10+ million rows/second ingestion into local DuckDB databases
Direct Parquet Output: Skip intermediate processing steps for immediate analytics
Cross-Topic Queries: Join and analyze data across multiple topics in shared DuckDB files
Backup-Free Design: S3 storage eliminates backup infrastructure, create unlimited read replicas
Massive Concurrency: Handle thousands of concurrent writers into the same topic (table)
Massive Throughput: FlightRPC streaming supports high throughput, several GB/s per node, per port
Future BI Integration: Planned FlightSQL support for direct BI tool connectivity
Cloud Agnostic: Deploy anywhere with S3-compatible storage, AWS S3 and Minio tested
Horizontal Scalability: BoilStream is built for horizontal scalability by just adding more nodes, graceful shutdowns/rolling updates, etc.
Secure by Design: Run FlightRPC with TLS for data-in-transit encryption, leverage S3 encryption settings, or optionally encrypt generated Parquet files for achieving client controlled encryption for data-at-rest. Clients use separate Admin FlightRPC API for authentication to avoid slowing down the data path.

Key Benefits

Simplified Architecture

Traditional streaming pipelines require multiple components:

Stream ingestion server clusters
Processing worker clusters for reading from the streaming servers and storing on Data Lake
Storage transformers for optimizing the data on the Data Lake

BoilStream replaces this complex pipeline with a single component that handles everything from ingestion to storage.

Immediate Analytics

Data lands in dual storage for immediate access:

S3 Pipeline (Diskless):

Optimized Parquet files immediately queryable
No waiting for batch processing
Direct integration with analytics tools

DuckDB Persistence (Local):

10+ million rows/second ingestion performance
Cross-topic joins and complex queries
Live queries over past N hours (future roadmap)
Window functions for time-series analysis (future roadmap)
Schema validation on ingestion
Integrated Arrow Schema registry with Valkey (Redis)

Production Scale

Built for enterprise workloads:

10,000+ concurrent sessions tested
3 GB/s sustained throughput
DuckLake integration
Transactions preserving Parquet metadata columns (full transaction recovery, including partial transaction data)
Built-in support for monitoring and metrics with Prometheus metrics collections API. Use the existing docker compose file with Prometheus and Grafana containers for immediate Dashboard metrics.

Use Cases

BoilStream is perfect for:

Real-time Analytics: Stream data for immediate analysis with dual storage
Data Lake Ingestion: Populate data lakes with minimal latency via diskless pipeline
Local Database Warehousing: High-performance ingestion into shared DuckDB databases
Cross-Topic Analytics: Join and analyze data across multiple streams locally
Time-Series Analysis: Window queries over historical data (future roadmap)
BI Tool Integration: Direct connectivity via FlightSQL (future roadmap)
Log Processing: Ingest and transform log files in real-time
IoT Data: Handle high-volume sensor data streams
Event Sourcing: Capture and store event streams with backup-free architecture

Next Steps

Ready to get started? Check out our Quick Start Guide to see BoilStream in action.

Introduction ​

What is BoilStream? ​

Key Benefits ​

Simplified Architecture ​

Immediate Analytics ​

Production Scale ​

Use Cases ​

Next Steps ​