Architecture
BoilStream is a high-performance stream processor that provides dual storage capabilities: a diskless S3 pipeline for immediate analytics and high-performance DuckDB persistence for local database features. Built with Rust and Apache Arrow, it provides real-time data transformations with materialized views, supporting ingestion rates exceeding 10 million rows per second into local DuckDB databases.
Core Architecture
BoilStream follows a dual-storage architecture that combines a diskless S3 pipeline with optional high-performance DuckDB persistence, eliminating traditional pipeline complexity while providing enterprise-grade performance and reliability.
Key Components
FlightRPC Interface
- Protocol: Apache Arrow FlightRPC for high-throughput data streaming
- Security: TLS encryption with JWT-based authentication
- Performance: Zero-copy data transfers using Apache Arrow format
- Compatibility: Works with DuckDB, browser WASM, and any FlightRPC client
Queue
- Request Buffering: Queues incoming FlightRPC requests for processing
- Load Balancing: Distributes work across multiple stream processors
- Backpressure Management: Prevents system overload during traffic spikes
- Session Management: Tracks concurrent client connections
Stream Processors
- Dual Storage Pipeline: Simultaneous diskless S3 streaming and optional DuckDB persistence
- Diskless S3 Design: No local storage dependencies for S3 pipeline, eliminating failure points
- High-Performance DuckDB: 10+ million rows/second ingestion into local databases
- Rust Performance: Zero-copy processing with memory safety
- Concurrent Sessions: Supports 10,000+ simultaneous connections
- Backup-Free Architecture: S3 storage eliminates backup needs, create unlimited read replicas
SQL Engine
- Materialized Views: Processes CREATE VIEW transformations for real-time streaming
- Window Queries: Future roadmap feature using shared DuckDB databases for time-based analysis
- DuckDB Integration: Full DuckDB SQL compatibility for familiar syntax
- Real-time Transformations: Applies SQL logic to derive child topics
- Optimised SQL Processing: Prepared statements, batch executing all views
DuckDB Persistence Engine
- Ultra-High Performance: 10+ million rows/second ingestion rate
- Shared Database Files: Multiple topics stored in shared
.duckdb
database files - Cross-Topic Queries: Query and join data across multiple topics in local databases
- Backup-Free Design: S3 pipeline provides automatic replication, no backup infrastructure needed
- FlightSQL Integration: Future roadmap for BI tool integration with shared databases
- Historical Queries: Live queries over past N hours of ingested data (roadmap)
- Window Functions: Planned source for time-series window queries across topics
PGWire Server
- PostgreSQL Protocol: Full PostgreSQL wire protocol compatibility
- BI Tool Integration: Direct connection for DBeaver, Tableau, Power BI, psql
- Cursor Support: Efficient large result set handling through extended query protocol
- Prepared Statements: Full parameter binding and type inference
- Query Cancellation: Standard PostgreSQL query cancellation support
- TLS Encryption: Optional TLS encryption for secure connections
- Real-time Analytics: Query streaming data through familiar PostgreSQL interface
Materialized Views Architecture
BoilStream's materialized views create parent-child topic relationships where each view becomes an independent output stream.
D LOAD airport;
D CREATE SECRET airport_boilstream_admin (
type airport,
auth_token 'eyJraWQiOiJ6YWZsU0RY...',
scope 'grpc+tls://localhost:50051/'
);
D ATTACH 'boilstream' (TYPE AIRPORT, location 'grpc+tls://localhost:50051/');
D CREATE TABLE boilstream.s3.people (name VARCHAR, age INT, tags VARCHAR[]);
D CREATE VIEW boilstream.s3.filtered_a AS SELECT * FROM boilstream.s3.people WHERE name LIKE 'a%';
D CREATE VIEW boilstream.s3.filtered_b AS SELECT * FROM boilstream.s3.people WHERE name LIKE 'b%';
D CREATE VIEW boilstream.s3.filtered_adults AS SELECT * FROM boilstream.s3.people WHERE age > 50;
D SELECT table_name, comment FROM duckdb_tables();
┌────────────────────────┬─────────────────────────────────────────────────────────────────────────────┐
│ table_name │ comment │
│ varchar │ varchar │
├────────────────────────┼─────────────────────────────────────────────────────────────────────────────┤
│ people→filtered_adults │ Materialized view: SELECT * FROM boilstream.s3.people WHERE age > 50; │
│ people→filtered_b │ Materialized view: SELECT * FROM boilstream.s3.people WHERE name LIKE 'b%'; │
│ people→filtered_a │ Materialized view: SELECT * FROM boilstream.s3.people WHERE name LIKE 'a%'; │
│ people │ Topic created from DuckDB Airport CREATE TABLE request for table 'people' │
└────────────────────────┴─────────────────────────────────────────────────────────────────────────────┘
Topic Naming Convention
- Base Topics: Standard table names (e.g.,
events
,users
,orders
) - Derived Topics: Parent→child format (e.g.,
events→login_events
) - Schema Inheritance: Child topics inherit parent schema with transformations
- Independent Storage: Each topic writes to separate S3 paths
Data Flow Architecture
BoilStream processes data through a streamlined pipeline that eliminates traditional ETL complexity.
Storage Architecture
BoilStream writes analytics-ready Parquet files directly to S3-compatible storage with hive partitioning.
Main topic and derived topic prefixes.
topic=people/
topic=people→filtered_a/
topic=people→filtered_b/
topic=people→filtered_adults/
Full S3 Object Key examples. Both topic name and respective topic id are part of the path for allowing searching with both. Schema version is part of the path.
topic=people/id=13542429894004395827/schema=1/year=2025/month=06/day=16/hour=18/13542429894004395827_6c60e772-2882-4c84-b4e1-0c92dda861aa.parquet
topic=people→filtered_a/id=1897081705280914401/schema=1/year=2025/month=06/day=16/hour=18/1897081705280914401_5bc0cb41-03eb-4496-b820-429d6e3d4bf0.parquet
topic=people→filtered_b/id=1534318511235352566/schema=1/year=2025/month=06/day=16/hour=18/1534318511235352566_9d023431-610d-4956-ae95-cc61986aa9e4.parquet
topic=people→filtered_adults/id=7788867887317207037/schema=1/year=2025/month=06/day=16/hour=18/7788867887317207037_35a53f2a-e469-42fa-9587-dfb70d7331df.parquet
Storage Features
- Schema Evolution: Automatic versioning with backward compatibility
- Atomic Writes: S3 multipart uploads ensure data consistency
- Optimized Files: Row group streaming creates large, analytics-ready files
- No Small Files: S3 Multipart uploads with concurrent Parquet row group writing ensures high throughput and large Parquet files. Determined by the S3 Flush Interval setting and actual data ingestion stream throughput and Parquet compression ratio versus minimal 5MB S3 Multipart upload part. The more data you send and the longer the S3 Flush interval the better chance of actually hitting the 5MB S3 multipart upload limit and producing bigger Parquet files. So, stream it big!
Deployment Architecture
BoilStream supports flexible deployment patterns from development to enterprise scale. There is no need for Load Balancers as BoilStream will handle it internally by reserving IP+Port for each client and redirecting clients when necessary like during rolling updates or load balancing between servers.
Each BoilStream node includes Valkey node and they form cluster.
Performance Characteristics
Throughput
- 10,000+ concurrent sessions tested in production
- 2.5 GB/s sustained throughput (16-core instance)
- FlightRPC performance with single ingestion core able to handle all traffic (supports multiple ingestion cores for very high-end instances)
- Linear scaling with additional CPU cores (mostly for backend processing)
- Zero-copy pipeline processing minimizes memory overhead
- Zero-copy SQL processing minimizes memory copying
Latency
- Sub-second materialized view updates for streaming transformations
- Immediate S3 writes with atomic commits
- No batch delays - data available as soon as written
- Real-time query responses through FlightRPC
Reliability
- Diskless architecture eliminates local storage failures
- Automatic retries for S3 failures forever (Admin situation)
- Schema validation prevents invalid data ingestion (both topic and data schema validation)
- Graceful degradation under high load with backpressure
- Graceful shutdown 3-level shutdown for e.g. rolling updates
Integration Points
DuckDB Integration
- Native FlightRPC support through Airport extension
- Standard SQL syntax - no learning curve
- WASM compatibility for browser-based applications
- Extension ecosystem leverages existing DuckDB extensions
- PostgreSQL Protocol: Query DuckDB persistence through standard PostgreSQL clients
BI Tool Integration
- PostgreSQL Compatibility: Connect any PostgreSQL-compatible BI tool
- DBeaver Support: Tested with full schema browsing and query execution
- Standard Drivers: Use existing PostgreSQL JDBC/ODBC drivers
- Real-time Analytics: Query live streaming data through familiar interfaces
Cloud Integration
- S3-compatible storage (AWS S3, MinIO, Azure Blob, GCS)
- Authentication providers (AWS Cognito, Azure AD, Auth0, Okta)
- Monitoring integration (Prometheus metrics, Grafana dashboards)
Security Architecture
Security Features
- End-to-end TLS encryption for data in transit
- JWT-based authentication with configurable providers
- RBAC/ABAC authorization for fine-grained access control to topics and operations
- Data schema validation prevents malformed data injection and ensures data quality
- IAM integration for fine-grained S3 access control
- Audit logging for compliance and monitoring
This architecture enables BoilStream to provide enterprise-grade streaming data processing with the simplicity of SQL, eliminating the complexity of traditional streaming pipelines while maintaining high performance and reliability.