Introduction
BoilStream is a high-performance data ingestion system that provides dual storage capabilities: diskless streaming to your data lake and high-performance local DuckDB persistence. Using familiar SQL syntax and built with Rust and Apache Arrow, it eliminates complex ETL pipelines by landing data as analytics-ready Parquet files immediately while supporting 10+ million rows/second ingestion into local databases.
BoilStream uses innovative concurrent Parquet row group streaming technology integrated with S3 Multipart Uploads. This allows writing a single big Parquet file by concurrently writing row groups and deciding when to finalize the Parquet file and multipart upload based on accumulated size. This avoids high fragmentation and small files on S3 and makes the initial files on S3 already suitable for analytics - like with DuckLake.
What is BoilStream?
BoilStream bridges the gap between data sources and data lakes by providing:
- Dual Storage Architecture: Diskless S3 streaming + optional high-performance DuckDB persistence
- SQL-First Interface: Use DuckDB's powerful SQL to transform and stream data
- Ultra-High Performance: 10+ million rows/second ingestion into local DuckDB databases
- Direct Parquet Output: Skip intermediate processing steps for immediate analytics
- Cross-Topic Queries: Join and analyze data across multiple topics in shared DuckDB files
- Backup-Free Design: S3 storage eliminates backup infrastructure, create unlimited read replicas
- Massive Concurrency: Handle thousands of concurrent writers into the same topic (table)
- Massive Throughput: FlightRPC streaming supports high throughput, several GB/s per node, per port
- Future BI Integration: Planned FlightSQL support for direct BI tool connectivity
- Cloud Agnostic: Deploy anywhere with S3-compatible storage, AWS S3 and Minio tested
- Horizontal Scalability: BoilStream is built for horizontal scalability by just adding more nodes, graceful shutdowns/rolling updates, etc.
- Secure by Design: Run FlightRPC with TLS for data-in-transit encryption, leverage S3 encryption settings, or optionally encrypt generated Parquet files for achieving client controlled encryption for data-at-rest. Clients use separate Admin FlightRPC API for authentication to avoid slowing down the data path.
Key Benefits
Simplified Architecture
Traditional streaming pipelines require multiple components:
- Stream ingestion server clusters
- Processing worker clusters for reading from the streaming servers and storing on Data Lake
- Storage transformers for optimizing the data on the Data Lake
BoilStream replaces this complex pipeline with a single component that handles everything from ingestion to storage.
Immediate Analytics
Data lands in dual storage for immediate access:
S3 Pipeline (Diskless):
- Optimized Parquet files immediately queryable
- No waiting for batch processing
- Direct integration with analytics tools
DuckDB Persistence (Local):
- 10+ million rows/second ingestion performance
- Cross-topic joins and complex queries
- Live queries over past N hours (future roadmap)
- Window functions for time-series analysis (future roadmap)
- Schema validation on ingestion
- Integrated Arrow Schema registry with Valkey (Redis)
Production Scale
Built for enterprise workloads:
- 10,000+ concurrent sessions tested
- 3 GB/s sustained throughput
- DuckLake integration
- Transactions preserving Parquet metadata columns (full transaction recovery, including partial transaction data)
- Built-in support for monitoring and metrics with Prometheus metrics collections API. Use the existing docker compose file with Prometheus and Grafana containers for immediate Dashboard metrics.
Use Cases
BoilStream is perfect for:
- Real-time Analytics: Stream data for immediate analysis with dual storage
- Data Lake Ingestion: Populate data lakes with minimal latency via diskless pipeline
- Local Database Warehousing: High-performance ingestion into shared DuckDB databases
- Cross-Topic Analytics: Join and analyze data across multiple streams locally
- Time-Series Analysis: Window queries over historical data (future roadmap)
- BI Tool Integration: Direct connectivity via FlightSQL (future roadmap)
- Log Processing: Ingest and transform log files in real-time
- IoT Data: Handle high-volume sensor data streams
- Event Sourcing: Capture and store event streams with backup-free architecture
Next Steps
Ready to get started? Check out our Quick Start Guide to see BoilStream in action.