DuckDB Persistence

BoilStream provides optional high-performance local DuckDB persistence alongside its core diskless S3 pipeline. This dual-storage architecture enables immediate local querying while maintaining cloud-scale analytics capabilities.

Architecture Overview

Dual Storage Design

text

Client Data → BoilStream → S3 Parquet Files (diskless, primary)
                         → Shared DuckDB Files (local, optional)

Core S3 Pipeline (Always Active):

Diskless streaming to S3-compatible storage
No local storage dependencies
Immediate data availability for analytics
Eliminates local storage failure points

Optional DuckDB Layer:

10+ million rows/second ingestion performance
Shared database files across multiple topics
Cross-topic queries and joins
No backup infrastructure needed (S3 provides replication)

When to Enable DuckDB Persistence

High-Performance Local Queries

Ultra-fast ingestion: 10+ million rows/second into local databases
Cross-topic analytics: Join data across multiple topics in shared databases
Complex queries: Advanced SQL operations without S3 latency
Development workflows: Local data exploration and testing

Backup-Free Architecture

S3 Replication: Cloud storage eliminates backup infrastructure needs
Disaster Recovery: Restore local databases from S3 Parquet files
Operational Simplicity: No backup schedules or storage management

Future Roadmap Features

Window Queries: Time-series analysis using local database as source
FlightSQL Integration: Direct BI tool connectivity to local databases
Historical Analysis: Live queries over past N hours of ingested data
Real-time Dashboards: Low-latency queries for operational monitoring
Read Replicas: Create unlimited copies from S3 data

Configuration

Basic Configuration

yaml

# Enable high-performance DuckDB persistence
duckdb_persistence:
  enabled: true # Enable DuckDB persistence
  storage_path: "/data/duckdb" # Directory for database files
  max_writers: 10 # Concurrent database writers

# Core S3 pipeline continues unchanged
storage:
  backends:
    - name: "primary-s3"
      backend_type: "s3"
      enabled: true
      primary: true
      # ... S3 configuration

Configuration Options

Field	Type	Default	Description
`duckdb_persistence.enabled`	boolean	false	Enable DuckDB persistence
`duckdb_persistence.storage_path`	string	`"/tmp/duckdb/topics"`	Directory for shared database files
`duckdb_persistence.max_writers`	number	10	Number of concurrent database writers
`duckdb_persistence.dry_run`	boolean	false	Process Arrow data but skip writes
`duckdb_persistence.super_dry_run`	boolean	false	Completely skip DuckDB processing

Environment Variable Configuration

bash

# Enable DuckDB persistence via environment variables
DUCKDB_PERSISTENCE_ENABLED=true
DUCKDB_PERSISTENCE_STORAGE_PATH=/data/duckdb
DUCKDB_PERSISTENCE_MAX_WRITERS=16

# Run BoilStream with dual storage
boilstream --config config.yaml

Performance Characteristics

Ingestion Performance

Ultra-High Throughput:

10+ million rows/second into local DuckDB databases
Concurrent writers scale with CPU cores
Memory-efficient Arrow-to-DuckDB conversion
Zero-copy data processing where possible

Comparison with S3 Pipeline:

S3 Pipeline: 3+ GB/s diskless streaming
DuckDB Layer: 10M+ rows/s with ~5-10% overhead
Combined: Dual persistence with minimal performance impact

Query Performance

Local Database Benefits:

Sub-millisecond query response times
Cross-topic joins without S3 roundtrips
Complex aggregations with DuckDB's optimized engine
Window functions for time-series analysis

Storage Efficiency:

Shared database files eliminate duplication
Columnar storage with compression
Automatic schema evolution and indexing

Database File Structure

Shared Database Architecture Example

text

/data/duckdb/
├── shared_db_0.duckdb    # Database writer 0
├── shared_db_1.duckdb    # Database writer 1
├── shared_db_2.duckdb    # Database writer 2
└── ...

Key Characteristics:

Multiple topics stored in each database file
Priority based load balancing across database writers
Cross-topic queries within and across databases
Table naming: topic_{id} for each topic

Query Examples

sql

-- Connect to local DuckDB databases
ATTACH '/data/duckdb/shared_db_0.duckdb' AS db0;
ATTACH '/data/duckdb/shared_db_1.duckdb' AS db1;

-- Query single topic
SELECT * FROM db0.topic_12345 WHERE timestamp > NOW() - INTERVAL 1 HOUR;

-- Cross-topic join
SELECT u.user_id, e.event_type, e.timestamp
FROM db0.topic_users u
JOIN db1.topic_events e ON u.user_id = e.user_id
WHERE e.timestamp > NOW() - INTERVAL 1 DAY;

-- Time-series aggregation
SELECT
  DATE_TRUNC('hour', timestamp) as hour,
  COUNT(*) as event_count
FROM db0.topic_events
WHERE timestamp > NOW() - INTERVAL 24 HOURS
GROUP BY hour
ORDER BY hour;

Development Workflows

Local Development Setup

yaml

# Development with DuckDB persistence
duckdb_persistence:
  enabled: true
  storage_path: "./dev-duckdb"
  max_writers: 4

storage:
  backends:
    - name: "local-minio"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "http://localhost:9000"
      bucket: "dev-bucket"
      access_key: "minioadmin"
      secret_key: "minioadmin"
      use_path_style: true

Production Deployment

Recommended Settings

yaml

# Production DuckDB persistence
duckdb_persistence:
  enabled: true
  storage_path: "/data/duckdb"
  max_writers: 16 # Scale with CPU cores (2x cores recommended)

# Ensure S3 backend remains primary
storage:
  backends:
    - name: "production-s3"
      backend_type: "s3"
      enabled: true
      primary: true # S3 failures fail the operation
      # ... S3 config

Performance Tuning

Storage Configuration:

bash

# Use fast local storage for optimal performance
duckdb_persistence:
  storage_path: "/nvme/duckdb"  # NVMe SSD recommended

Memory Optimization:

yaml

# Tune for high-memory instances
processing:
  data_processing_threads: 16
  buffer_pool_max_size: 100

Monitoring and Operations

Key Metrics

Monitor DuckDB persistence performance:

Ingestion Rate: Rows per second into local databases
Write Latency: Time from ingestion to database commit
Database Size: Growth rate of database files
Query Performance: Local query response times

Health Checks

bash

# Check database file sizes
ls -lh /data/duckdb/

# Verify database accessibility
duckdb /data/duckdb/shared_db_0.duckdb "SELECT COUNT(*) FROM topic_12345;"

Disaster Recovery

Restore from S3:

DuckDB databases are expendable (S3 is primary)
Recreate local databases by replaying S3 Parquet data
Use S3 data for unlimited read replicas

Migration:

Scale up: Add more database writers
Scale out: Add more BoilStream nodes
Storage migration: Move database files to faster storage

Roadmap Features

Window Queries (Planned)

sql

-- Future: Window functions using DuckDB as source
SELECT
  user_id,
  event_type,
  COUNT(*) OVER (
    PARTITION BY user_id
    ORDER BY timestamp
    RANGE BETWEEN INTERVAL 1 HOUR PRECEDING AND CURRENT ROW
  ) as events_last_hour
FROM local_events
WHERE timestamp > NOW() - INTERVAL 24 HOURS;

FlightSQL Integration (Planned)

text

BI Tools → FlightSQL API → Local DuckDB Databases
  ↳ Tableau, PowerBI, Grafana direct connectivity

Historical Analysis (Planned)

sql

-- Future: Live queries over rolling time windows
SELECT * FROM past_24_hours('topic_events')
WHERE user_type = 'premium';

Best Practices

When to Enable

Enable DuckDB Persistence When:

You need cross-topic analytical queries
Local query performance is critical
Complex time-series analysis is required
BI tools need direct database connectivity

S3-Only May Be Sufficient When:

Simple append-only workloads
Analytics tools can query S3 directly
Minimal local storage is preferred
S3 Analytics performance is enough

Performance Optimization

Scale writers (recommended: 2x number of CPU cores to match instance capabilities)
Use fast local storage (NVMe SSD preferred)
Monitor database file sizes and plan storage capacity
Leverage S3 for backup-free architecture

Security Considerations

DuckDB files contain same data as S3 Parquet files
(pro-version) FlightSQL with access control for querying DuckDB files

This dual-storage architecture provides the best of both worlds: cloud-scale analytics with S3 and high-performance real-time local queries with DuckDB, all while eliminating backup infrastructure complexity.

DuckDB Persistence ​

Architecture Overview ​

Dual Storage Design ​

When to Enable DuckDB Persistence ​

High-Performance Local Queries ​

Backup-Free Architecture ​

Future Roadmap Features ​

Configuration ​

Basic Configuration ​

Configuration Options ​

Environment Variable Configuration ​

Performance Characteristics ​

Ingestion Performance ​

Query Performance ​

Database File Structure ​

Shared Database Architecture Example ​

Query Examples ​

Development Workflows ​

Local Development Setup ​

Production Deployment ​

Recommended Settings ​

Performance Tuning ​

Monitoring and Operations ​

Key Metrics ​

Health Checks ​

Disaster Recovery ​

Roadmap Features ​

Window Queries (Planned) ​

FlightSQL Integration (Planned) ​

Historical Analysis (Planned) ​

Best Practices ​

When to Enable ​

Performance Optimization ​

Security Considerations ​

DuckDB Persistence

Architecture Overview

Dual Storage Design

When to Enable DuckDB Persistence

High-Performance Local Queries

Backup-Free Architecture

Future Roadmap Features

Configuration

Basic Configuration

Configuration Options

Environment Variable Configuration

Performance Characteristics

Ingestion Performance

Query Performance

Database File Structure

Shared Database Architecture Example

Query Examples

Development Workflows

Local Development Setup

Production Deployment

Recommended Settings

Performance Tuning

Monitoring and Operations

Key Metrics

Health Checks

Disaster Recovery

Roadmap Features

Window Queries (Planned)

FlightSQL Integration (Planned)

Historical Analysis (Planned)

Best Practices

When to Enable

Performance Optimization

Security Considerations