Skip to content

Architecture

BoilStream is a high-performance stream processor that provides dual storage capabilities: a diskless cloud storage pipeline for immediate analytics and high-performance DuckDB persistence for local database features. Built with Rust and Apache Arrow, it provides real-time data transformations with materialized views, supporting ingestion rates exceeding 10 million rows per second into local DuckDB databases.

Core Architecture

BoilStream follows a dual-storage architecture that combines a diskless cloud storage pipeline with optional high-performance DuckDB persistence, eliminating traditional pipeline complexity while providing enterprise-grade performance and reliability.

Key Components

FlightRPC Interface

  • Protocol: Apache Arrow FlightRPC for high-throughput data streaming
  • Security: TLS encryption with JWT-based authentication
  • Performance: Zero-copy data transfers using Apache Arrow format
  • Compatibility: DuckDB Airport extension, PyArrow, and any FlightRPC client

HTTP/2 Arrow Interface

  • Protocol: HTTP/2 with Arrow IPC format for browser-based ingestion
  • Security: TLS with BLAKE3 HMAC token authentication
  • Performance: 40,000+ concurrent TLS connections tested, 2+ GB/s throughput with Arrow payloads
  • Request Size: 128 KiB max (131,072 bytes)
  • Compatibility: JavaScript/TypeScript with Flechette library, any HTTP/2 client

Kafka Protocol Interface

  • Protocol: Kafka wire protocol for drop-in compatibility
  • Format Support: Confluent Avro with Schema Registry integration
  • Performance: SIMD-accelerated Avro to Arrow conversion
  • Compatibility: Any Kafka producer library, existing Kafka applications

Queue

  • Request Buffering: Queues incoming FlightRPC requests for processing
  • Load Balancing: Distributes work across multiple stream processors
  • Backpressure Management: Prevents system overload during traffic spikes
  • Session Management: Tracks concurrent client connections

Stream Processors

  • Dual Storage Pipeline: Simultaneous diskless cloud storage streaming and optional DuckDB persistence
  • Diskless Cloud Design: No local storage dependencies for cloud pipeline, eliminating failure points
  • High-Performance DuckDB: 10+ million rows/second ingestion into local databases
  • Rust Performance: Zero-copy processing with memory safety
  • Concurrent Sessions: Supports 10,000+ simultaneous connections
  • Backup-Free Architecture: Cloud storage eliminates backup needs, create unlimited read replicas

SQL Engine

  • Materialized Views: Processes CTAS SELECT transformations for real-time streaming
  • Window Queries: Future roadmap feature using shared DuckDB databases for time-based analysis
  • DuckDB Integration: Full DuckDB SQL compatibility for familiar syntax
  • Real-time Transformations: Applies SQL logic to derive child topics
  • Optimised SQL Processing: Prepared statements, batch executing all views

DuckDB Persistence Engine

  • Ultra-High Performance: 10+ million rows/second ingestion rate
  • Shared Database Files: Multiple topics stored in shared .duckdb database files
  • Cross-Topic Queries: Query and join data across multiple topics in local databases
  • Backup-Free Design: Cloud storage pipeline provides automatic replication, no backup infrastructure needed
  • FlightSQL Integration: BI tool integration with shared databases via FlightSQL protocol
  • Historical Queries: Live queries over past N hours of ingested data (roadmap)
  • Window Functions: Future roadmap feature for time-series window queries across topics

PGWire Server

  • PostgreSQL Protocol: Full PostgreSQL wire protocol compatibility
  • BI Tool Integration: Direct connection for DBeaver, Tableau, Power BI, psql
  • Cursor Support: Efficient large result set handling through extended query protocol
  • Prepared Statements: Full parameter binding and type inference
  • Query Cancellation: Standard PostgreSQL query cancellation support
  • TLS Encryption: Optional TLS encryption for secure connections
  • Real-time Analytics: Query streaming data through familiar PostgreSQL interface

Materialized Views Architecture

BoilStream's materialized views create parent-child topic relationships where each view becomes an independent output stream.

sql
D INSTALL airport FROM community;
D LOAD airport;
D CREATE SECRET airport_boilstream_admin (
    type airport,
    auth_token 'eyJraWQiOiJ6YWZsU0RY...',
    scope 'grpc+tls://localhost:50051/'
  );
D ATTACH 'boilstream' (TYPE AIRPORT, location 'grpc+tls://localhost:50051/');
D SELECT table_name, comment FROM duckdb_tables();
┌────────────────────────┬─────────────────────────────────────────────────────────────────────────────┐
│       table_name       │                                   comment                                   │
varcharvarchar
├────────────────────────┼─────────────────────────────────────────────────────────────────────────────┤
│ filtered_adults        │ Materialized view: SELECT * FROM boilstream.s3.people WHERE age > 50;       │
│ filtered_b             │ Materialized view: SELECT * FROM boilstream.s3.people WHERE name LIKE 'b%'; │
│ filtered_a             │ Materialized view: SELECT * FROM boilstream.s3.people WHERE name LIKE 'a%'; │
│ people                 │ Topic created from DuckDB Airport CREATE TABLE request for table 'people'
└────────────────────────┴─────────────────────────────────────────────────────────────────────────────┘

Run throught the Postgres interface, e.g. with DBeaver or psql. This creates one main topic and three derived topics (realtime materialized views)

sql
CREATE TABLE boilstream.s3.people (name VARCHAR, age INT, tags VARCHAR[]);
CREATE TABLE boilstream.s3.filtered_a AS SELECT * FROM boilstream.s3.people WHERE name LIKE 'a%';
CREATE TABLE boilstream.s3.filtered_b AS SELECT * FROM boilstream.s3.people WHERE name LIKE 'b%';
CREATE TABLE boilstream.s3.filtered_adults AS SELECT * FROM boilstream.s3.people WHERE age > 50;

Topic Naming Convention

  • Base Topics: Standard table names (e.g., events, users, orders)
  • Derived Topics: Custom names for materialized views (e.g., login_events, filtered_adults)
  • Schema Inheritance: Derived topics inherit parent schema with transformations
  • Independent Storage: Each topic writes to separate cloud storage paths

Data Flow Architecture

BoilStream processes data through a streamlined pipeline that eliminates traditional ETL complexity.

Storage Architecture

BoilStream writes analytics-ready Parquet files directly to cloud storage (S3, Azure Blob, GCS, MinIO, or filesystem) with hive partitioning.

Storage Path Structure

shell
# Base topics
topic=people/
topic=events/
topic=orders/

# Derived topics (materialized views)
topic=filtered_adults/
topic=login_events/
topic=high_value_orders/

Full Storage Path Examples

Both topic name and topic ID are included in the path for efficient searching. Schema version enables evolution tracking.

shell
# Base topic
topic=people/id=13542429894004395827/schema=1/year=2025/month=06/day=16/hour=18/13542429894004395827_6c60e772-2882-4c84-b4e1-0c92dda861aa.parquet

# Derived topics
topic=filtered_adults/id=7788867887317207037/schema=1/year=2025/month=06/day=16/hour=18/7788867887317207037_35a53f2a-e469-42fa-9587-dfb70d7331df.parquet
topic=login_events/id=1897081705280914401/schema=1/year=2025/month=06/day=16/hour=18/1897081705280914401_5bc0cb41-03eb-4496-b820-429d6e3d4bf0.parquet

Storage Features

  • Schema Evolution: Automatic versioning with backward compatibility
  • Atomic Writes: Cloud storage multipart uploads ensure data consistency
  • Optimized Files: Row group streaming creates large, analytics-ready files
  • No Small Files: Cloud storage multipart uploads with concurrent Parquet row group writing ensures high throughput and large Parquet files. Determined by the flush interval setting and actual data ingestion stream throughput and Parquet compression ratio versus minimal multipart upload part size. The more data you send and the longer the flush interval the better chance of actually hitting the multipart upload limits and producing bigger Parquet files. So, stream it big!

Deployment Architecture

BoilStream runs as a single standalone server without external dependencies, making deployment straightforward.

Deployment Options

  • Development: Single binary with filesystem storage
  • Production: Single server with cloud storage backend
  • High Availability: Deploy multiple independent BoilStream instances writing to the same cloud storage

Performance Characteristics

Throughput

  • 40,000+ concurrent HTTP/2 connections tested with TLS encryption
  • 10,000+ concurrent FlightRPC sessions in production
  • 2+ GB/s sustained throughput via HTTP/2 Arrow interface
  • 2.5 GB/s sustained throughput via FlightRPC (16-core instance)
  • Zero-copy pipeline processing minimizes memory overhead
  • Zero-copy SQL processing minimizes memory copying

Latency

  • Sub-second materialized view updates for streaming transformations
  • Immediate cloud storage writes with atomic commits
  • No batch delays - data available as soon as written
  • Real-time query responses through FlightRPC

Reliability

  • Diskless architecture eliminates local storage failures
  • Automatic retries for cloud storage failures forever (Admin situation)
  • Schema validation prevents invalid data ingestion (both topic and data schema validation)
  • Graceful degradation under high load with backpressure
  • Graceful shutdown 3-level shutdown for e.g. rolling updates

Integration Points

DuckDB Integration

  • Native FlightRPC support through Airport extension
  • Standard SQL syntax - no learning curve
  • WASM compatibility for browser-based applications
  • Extension ecosystem leverages existing DuckDB extensions
  • PostgreSQL Protocol: Query DuckDB persistence through standard PostgreSQL clients

BI Tool Integration

  • PostgreSQL Compatibility: Connect any PostgreSQL-compatible BI tool
  • DBeaver Support: Tested with full schema browsing and query execution
  • Standard Drivers: Use existing PostgreSQL JDBC/ODBC drivers
  • Real-time Analytics: Query live streaming data through familiar interfaces

Cloud Integration

  • Multi-cloud storage (AWS S3, Azure Blob Storage, Google Cloud Storage, MinIO, filesystem)
  • Authentication providers (AWS Cognito, Azure AD, Auth0, Okta)
  • Monitoring integration (Prometheus metrics, Grafana dashboards)

Security Architecture

Security Features

  • End-to-end TLS encryption for data in transit
  • JWT-based authentication with configurable providers
  • RBAC/ABAC authorization for fine-grained access control to topics and operations
  • Data schema validation prevents malformed data injection and ensures data quality
  • IAM integration for fine-grained cloud storage access control
  • Audit logging for compliance and monitoring

This architecture enables BoilStream to provide enterprise-grade streaming data processing with the simplicity of SQL, eliminating the complexity of traditional streaming pipelines while maintaining high performance and reliability.