Architecture

BoilStream is a high-performance stream processor that provides dual storage capabilities: a diskless S3 pipeline for immediate analytics and high-performance DuckDB persistence for local database features. Built with Rust and Apache Arrow, it provides real-time data transformations with materialized views, supporting ingestion rates exceeding 10 million rows per second into local DuckDB databases.

Core Architecture

BoilStream follows a dual-storage architecture that combines a diskless S3 pipeline with optional high-performance DuckDB persistence, eliminating traditional pipeline complexity while providing enterprise-grade performance and reliability.

Key Components

FlightRPC Interface

Protocol: Apache Arrow FlightRPC for high-throughput data streaming
Security: TLS encryption with JWT-based authentication
Performance: Zero-copy data transfers using Apache Arrow format
Compatibility: Works with DuckDB, browser WASM, and any FlightRPC client

Queue

Request Buffering: Queues incoming FlightRPC requests for processing
Load Balancing: Distributes work across multiple stream processors
Backpressure Management: Prevents system overload during traffic spikes
Session Management: Tracks concurrent client connections

Stream Processors

Dual Storage Pipeline: Simultaneous diskless S3 streaming and optional DuckDB persistence
Diskless S3 Design: No local storage dependencies for S3 pipeline, eliminating failure points
High-Performance DuckDB: 10+ million rows/second ingestion into local databases
Rust Performance: Zero-copy processing with memory safety
Concurrent Sessions: Supports 10,000+ simultaneous connections
Backup-Free Architecture: S3 storage eliminates backup needs, create unlimited read replicas

SQL Engine

Materialized Views: Processes CREATE VIEW transformations for real-time streaming
Window Queries: Future roadmap feature using shared DuckDB databases for time-based analysis
DuckDB Integration: Full DuckDB SQL compatibility for familiar syntax
Real-time Transformations: Applies SQL logic to derive child topics
Optimised SQL Processing: Prepared statements, batch executing all views

DuckDB Persistence Engine

Ultra-High Performance: 10+ million rows/second ingestion rate
Shared Database Files: Multiple topics stored in shared .duckdb database files
Cross-Topic Queries: Query and join data across multiple topics in local databases
Backup-Free Design: S3 pipeline provides automatic replication, no backup infrastructure needed
FlightSQL Integration: Future roadmap for BI tool integration with shared databases
Historical Queries: Live queries over past N hours of ingested data (roadmap)
Window Functions: Planned source for time-series window queries across topics

PGWire Server

PostgreSQL Protocol: Full PostgreSQL wire protocol compatibility
BI Tool Integration: Direct connection for DBeaver, Tableau, Power BI, psql
Cursor Support: Efficient large result set handling through extended query protocol
Prepared Statements: Full parameter binding and type inference
Query Cancellation: Standard PostgreSQL query cancellation support
TLS Encryption: Optional TLS encryption for secure connections
Real-time Analytics: Query streaming data through familiar PostgreSQL interface

Materialized Views Architecture

BoilStream's materialized views create parent-child topic relationships where each view becomes an independent output stream.

sql

D LOAD airport;
D CREATE SECRET airport_boilstream_admin (
    type airport,
    auth_token 'eyJraWQiOiJ6YWZsU0RY...',
    scope 'grpc+tls://localhost:50051/'
  );
D ATTACH 'boilstream' (TYPE AIRPORT, location 'grpc+tls://localhost:50051/');
D CREATE TABLE boilstream.s3.people (name VARCHAR, age INT, tags VARCHAR[]);
D CREATE VIEW boilstream.s3.filtered_a AS SELECT * FROM boilstream.s3.people WHERE name LIKE 'a%';
D CREATE VIEW boilstream.s3.filtered_b AS SELECT * FROM boilstream.s3.people WHERE name LIKE 'b%';
D CREATE VIEW boilstream.s3.filtered_adults AS SELECT * FROM boilstream.s3.people WHERE age > 50;
D SELECT table_name, comment FROM duckdb_tables();
┌────────────────────────┬─────────────────────────────────────────────────────────────────────────────┐
│       table_name       │                                   comment                                   │
│        varchar         │                                   varchar                                   │
├────────────────────────┼─────────────────────────────────────────────────────────────────────────────┤
│ people→filtered_adults │ Materialized view: SELECT * FROM boilstream.s3.people WHERE age > 50;       │
│ people→filtered_b      │ Materialized view: SELECT * FROM boilstream.s3.people WHERE name LIKE 'b%'; │
│ people→filtered_a      │ Materialized view: SELECT * FROM boilstream.s3.people WHERE name LIKE 'a%'; │
│ people                 │ Topic created from DuckDB Airport CREATE TABLE request for table 'people'   │
└────────────────────────┴─────────────────────────────────────────────────────────────────────────────┘

Topic Naming Convention

Base Topics: Standard table names (e.g., events, users, orders)
Derived Topics: Parent→child format (e.g., events→login_events)
Schema Inheritance: Child topics inherit parent schema with transformations
Independent Storage: Each topic writes to separate S3 paths

Data Flow Architecture

BoilStream processes data through a streamlined pipeline that eliminates traditional ETL complexity.

Storage Architecture

BoilStream writes analytics-ready Parquet files directly to S3-compatible storage with hive partitioning.

Main topic and derived topic prefixes.

shell

topic=people/
topic=people→filtered_a/
topic=people→filtered_b/
topic=people→filtered_adults/

Full S3 Object Key examples. Both topic name and respective topic id are part of the path for allowing searching with both. Schema version is part of the path.

shell

topic=people/id=13542429894004395827/schema=1/year=2025/month=06/day=16/hour=18/13542429894004395827_6c60e772-2882-4c84-b4e1-0c92dda861aa.parquet
topic=people→filtered_a/id=1897081705280914401/schema=1/year=2025/month=06/day=16/hour=18/1897081705280914401_5bc0cb41-03eb-4496-b820-429d6e3d4bf0.parquet
topic=people→filtered_b/id=1534318511235352566/schema=1/year=2025/month=06/day=16/hour=18/1534318511235352566_9d023431-610d-4956-ae95-cc61986aa9e4.parquet
topic=people→filtered_adults/id=7788867887317207037/schema=1/year=2025/month=06/day=16/hour=18/7788867887317207037_35a53f2a-e469-42fa-9587-dfb70d7331df.parquet

Storage Features

Schema Evolution: Automatic versioning with backward compatibility
Atomic Writes: S3 multipart uploads ensure data consistency
Optimized Files: Row group streaming creates large, analytics-ready files
No Small Files: S3 Multipart uploads with concurrent Parquet row group writing ensures high throughput and large Parquet files. Determined by the S3 Flush Interval setting and actual data ingestion stream throughput and Parquet compression ratio versus minimal 5MB S3 Multipart upload part. The more data you send and the longer the S3 Flush interval the better chance of actually hitting the 5MB S3 multipart upload limit and producing bigger Parquet files. So, stream it big!

Deployment Architecture

BoilStream supports flexible deployment patterns from development to enterprise scale. There is no need for Load Balancers as BoilStream will handle it internally by reserving IP+Port for each client and redirecting clients when necessary like during rolling updates or load balancing between servers.

Each BoilStream node includes Valkey node and they form cluster.

Performance Characteristics

Throughput

10,000+ concurrent sessions tested in production
2.5 GB/s sustained throughput (16-core instance)
FlightRPC performance with single ingestion core able to handle all traffic (supports multiple ingestion cores for very high-end instances)
Linear scaling with additional CPU cores (mostly for backend processing)
Zero-copy pipeline processing minimizes memory overhead
Zero-copy SQL processing minimizes memory copying

Latency

Sub-second materialized view updates for streaming transformations
Immediate S3 writes with atomic commits
No batch delays - data available as soon as written
Real-time query responses through FlightRPC

Reliability

Diskless architecture eliminates local storage failures
Automatic retries for S3 failures forever (Admin situation)
Schema validation prevents invalid data ingestion (both topic and data schema validation)
Graceful degradation under high load with backpressure
Graceful shutdown 3-level shutdown for e.g. rolling updates

Integration Points

DuckDB Integration

Native FlightRPC support through Airport extension
Standard SQL syntax - no learning curve
WASM compatibility for browser-based applications
Extension ecosystem leverages existing DuckDB extensions
PostgreSQL Protocol: Query DuckDB persistence through standard PostgreSQL clients

BI Tool Integration

PostgreSQL Compatibility: Connect any PostgreSQL-compatible BI tool
DBeaver Support: Tested with full schema browsing and query execution
Standard Drivers: Use existing PostgreSQL JDBC/ODBC drivers
Real-time Analytics: Query live streaming data through familiar interfaces

Cloud Integration

S3-compatible storage (AWS S3, MinIO, Azure Blob, GCS)
Authentication providers (AWS Cognito, Azure AD, Auth0, Okta)
Monitoring integration (Prometheus metrics, Grafana dashboards)

Security Architecture

Security Features

End-to-end TLS encryption for data in transit
JWT-based authentication with configurable providers
RBAC/ABAC authorization for fine-grained access control to topics and operations
Data schema validation prevents malformed data injection and ensures data quality
IAM integration for fine-grained S3 access control
Audit logging for compliance and monitoring

This architecture enables BoilStream to provide enterprise-grade streaming data processing with the simplicity of SQL, eliminating the complexity of traditional streaming pipelines while maintaining high performance and reliability.

Architecture ​

Core Architecture ​

Key Components ​

FlightRPC Interface ​

Queue ​

Stream Processors ​

SQL Engine ​

DuckDB Persistence Engine ​

PGWire Server ​

Materialized Views Architecture ​

Topic Naming Convention ​

Data Flow Architecture ​

Storage Architecture ​

Storage Features ​

Deployment Architecture ​

Performance Characteristics ​

Throughput ​

Latency ​

Reliability ​

Integration Points ​

DuckDB Integration ​

BI Tool Integration ​

Cloud Integration ​

Security Architecture ​

Security Features ​