Skip to content

DuckDB Persistence

BoilStream provides optional high-performance local DuckDB persistence alongside its core diskless cloud storage pipeline. This dual-storage architecture enables immediate local querying while maintaining cloud-scale analytics capabilities.

Architecture Overview

Dual Storage Design

text
Client Data → BoilStream → Cloud Storage Parquet Files (diskless, primary)
                         → Shared DuckDB Files (local, optional)

Core Cloud Storage Pipeline (Always Active):

  • Diskless streaming to cloud storage (S3, Azure Blob, GCS, MinIO, filesystem)
  • No local storage dependencies
  • Immediate data availability for analytics
  • Eliminates local storage failure points

Optional DuckDB Layer:

  • 10+ million rows/second ingestion performance
  • Shared database files across multiple topics
  • Cross-topic queries and joins
  • No backup infrastructure needed (cloud storage provides replication)

When to Enable DuckDB Persistence

High-Performance Local Queries

  • Ultra-fast ingestion: 10+ million rows/second into local databases
  • Cross-topic analytics: Join data across multiple topics in shared databases
  • Complex queries: Advanced SQL operations without cloud storage latency
  • Development workflows: Local data exploration and testing

Backup-Free Architecture

  • Cloud Replication: Cloud storage eliminates backup infrastructure needs
  • Disaster Recovery: Restore local databases from cloud storage Parquet files
  • Operational Simplicity: No backup schedules or storage management

BI Tool Integration

  • PostgreSQL Interface: Full BI tool connectivity via comprehensive PostgreSQL protocol
  • FlightSQL: Direct connectivity for FlightSQL-compatible tools
  • Power BI & Tableau: Full support through PostgreSQL interface with catalog support
  • DBeaver: Recommended management tool for SQL-based administration

Future Roadmap Features

  • Window Queries: Time-series analysis using local database as source
  • Historical Analysis: Live queries over past N hours of ingested data
  • Real-time Dashboards: Low-latency queries for operational monitoring
  • Read Replicas: Create unlimited copies from cloud storage data

Configuration

Basic Configuration

yaml
# Enable high-performance DuckDB persistence
duckdb_persistence:
  enabled: true # Enable DuckDB persistence
  storage_path: "/data/duckdb" # Directory for database files
  max_writers: 10 # Concurrent database writers

# Core cloud storage pipeline continues unchanged
storage:
  backends:
    - name: "primary-s3"
      backend_type: "s3"
      enabled: true
      primary: true
      # ... cloud storage configuration

Configuration Options

FieldTypeDefaultDescription
duckdb_persistence.enabledbooleanfalseEnable DuckDB persistence
duckdb_persistence.storage_pathstring"/tmp/duckdb/topics"Directory for shared database files
duckdb_persistence.max_writersnumber10Number of concurrent database writers
duckdb_persistence.dry_runbooleanfalseProcess Arrow data but skip writes
duckdb_persistence.super_dry_runbooleanfalseCompletely skip DuckDB processing

Starting with DuckDB Persistence

bash
# Run BoilStream - it will generate config.yaml on first run
./boilstream

# Edit config.yaml to enable DuckDB persistence
# Then restart BoilStream
./boilstream

Performance Characteristics

Ingestion Performance

Ultra-High Throughput:

  • 10+ million rows/second into local DuckDB databases
  • Concurrent writers scale with CPU cores
  • Memory-efficient Arrow-to-DuckDB conversion
  • Zero-copy data processing where possible

Comparison with Cloud Storage Pipeline:

  • Cloud Storage Pipeline: 3+ GB/s diskless streaming
  • DuckDB Layer: 10M+ rows/s with ~5-10% overhead
  • Combined: Dual persistence with minimal performance impact

Query Performance

Local Database Benefits:

  • Sub-millisecond query response times
  • Cross-topic joins without cloud storage roundtrips
  • Complex aggregations with DuckDB's optimized engine
  • Window functions for time-series analysis

Storage Efficiency:

  • Shared database files eliminate duplication
  • Columnar storage with compression
  • Automatic schema evolution and indexing

Database File Structure

Shared Database Architecture Example

text
/data/duckdb/
├── shared_db_0.duckdb    # Database writer 0
├── shared_db_1.duckdb    # Database writer 1
├── shared_db_2.duckdb    # Database writer 2
└── ...

Key Characteristics:

  • Multiple topics stored in each database file
  • Priority based load balancing across database writers
  • Cross-topic queries within and across databases
  • Table naming: topic_{id} for each topic

Query Examples

sql
-- Connect to local DuckDB databases
ATTACH '/data/duckdb/shared_db_0.duckdb' AS db0;
ATTACH '/data/duckdb/shared_db_1.duckdb' AS db1;

-- Query single topic
SELECT * FROM db0.topic_12345 WHERE timestamp > NOW() - INTERVAL 1 HOUR;

-- Cross-topic join
SELECT u.user_id, e.event_type, e.timestamp
FROM db0.topic_users u
JOIN db1.topic_events e ON u.user_id = e.user_id
WHERE e.timestamp > NOW() - INTERVAL 1 DAY;

-- Time-series aggregation
SELECT
  DATE_TRUNC('hour', timestamp) as hour,
  COUNT(*) as event_count
FROM db0.topic_events
WHERE timestamp > NOW() - INTERVAL 24 HOURS
GROUP BY hour
ORDER BY hour;

Development Workflows

Local Development Setup

yaml
# Development with DuckDB persistence
duckdb_persistence:
  enabled: true
  storage_path: "./dev-duckdb"
  max_writers: 4

storage:
  backends:
    - name: "local-minio"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "http://localhost:9000"
      bucket: "dev-bucket"
      access_key: "minioadmin"
      secret_key: "minioadmin"
      use_path_style: true

Production Deployment

yaml
# Production DuckDB persistence
duckdb_persistence:
  enabled: true
  storage_path: "/data/duckdb"
  max_writers: 16 # Scale with CPU cores (2x cores recommended)

# Ensure cloud storage backend remains primary
storage:
  backends:
    - name: "production-s3"
      backend_type: "s3"
      enabled: true
      primary: true # Cloud storage failures fail the operation
      # ... cloud storage config

Performance Tuning

Storage Configuration:

bash
# Use fast local storage for optimal performance
duckdb_persistence:
  storage_path: "/nvme/duckdb"  # NVMe SSD recommended

Memory Optimization:

yaml
# Tune for high-memory instances
processing:
  data_processing_threads: 16
  buffer_pool_max_size: 100

Monitoring and Operations

Key Metrics

Monitor DuckDB persistence performance:

  • Ingestion Rate: Rows per second into local databases
  • Write Latency: Time from ingestion to database commit
  • Database Size: Growth rate of database files
  • Query Performance: Local query response times

Health Checks

bash
# Check database file sizes
ls -lh /data/duckdb/

# Verify database accessibility
duckdb /data/duckdb/shared_db_0.duckdb "SELECT COUNT(*) FROM topic_12345;"

Disaster Recovery

Restore from Cloud Storage:

  1. DuckDB databases are expendable (cloud storage is primary)
  2. Recreate local databases by replaying cloud storage Parquet data
  3. Use cloud storage data for unlimited read replicas

Migration:

  1. Scale up: Add more database writers
  2. Scale out: Add more BoilStream nodes
  3. Storage migration: Move database files to faster storage

Roadmap Features

Window Queries (Planned)

sql
-- Future: Window functions using DuckDB as source
SELECT
  user_id,
  event_type,
  COUNT(*) OVER (
    PARTITION BY user_id
    ORDER BY timestamp
    RANGE BETWEEN INTERVAL 1 HOUR PRECEDING AND CURRENT ROW
  ) as events_last_hour
FROM local_events
WHERE timestamp > NOW() - INTERVAL 24 HOURS;

BI Tool Integration

text
BI Tools → PostgreSQL Interface → Local DuckDB Databases
  ↳ Power BI, Tableau, DBeaver, psql direct connectivity

BI Tools → FlightSQL API → Local DuckDB Databases  
  ↳ FlightSQL-compatible tools direct connectivity

Historical Analysis (Planned)

sql
-- Future: Live queries over rolling time windows
SELECT * FROM past_24_hours('topic_events')
WHERE user_type = 'premium';

Best Practices

When to Enable

Enable DuckDB Persistence When:

  • You need cross-topic analytical queries
  • Local query performance is critical
  • Complex time-series analysis is required
  • BI tools need direct database connectivity

Cloud Storage Only May Be Sufficient When:

  • Simple append-only workloads
  • Analytics tools can query cloud storage directly
  • Minimal local storage is preferred
  • Cloud storage analytics performance is enough

Performance Optimization

  1. Scale writers (recommended: 2x number of CPU cores to match instance capabilities)
  2. Use fast local storage (NVMe SSD preferred)
  3. Monitor database file sizes and plan storage capacity
  4. Leverage cloud storage for backup-free architecture

Security Considerations

  • DuckDB files contain same data as cloud storage Parquet files
  • (pro-version) FlightSQL with access control for querying DuckDB files

This dual-storage architecture provides the best of both worlds: cloud-scale analytics with cloud storage (S3, Azure Blob, GCS, MinIO, filesystem) and high-performance real-time local queries with DuckDB, all while eliminating backup infrastructure complexity.