Skip to content

DuckDB Persistence

BoilStream provides optional high-performance local DuckDB persistence alongside its core diskless S3 pipeline. This dual-storage architecture enables immediate local querying while maintaining cloud-scale analytics capabilities.

Architecture Overview

Dual Storage Design

text
Client Data → BoilStream → S3 Parquet Files (diskless, primary)
                         → Shared DuckDB Files (local, optional)

Core S3 Pipeline (Always Active):

  • Diskless streaming to S3-compatible storage
  • No local storage dependencies
  • Immediate data availability for analytics
  • Eliminates local storage failure points

Optional DuckDB Layer:

  • 10+ million rows/second ingestion performance
  • Shared database files across multiple topics
  • Cross-topic queries and joins
  • No backup infrastructure needed (S3 provides replication)

When to Enable DuckDB Persistence

High-Performance Local Queries

  • Ultra-fast ingestion: 10+ million rows/second into local databases
  • Cross-topic analytics: Join data across multiple topics in shared databases
  • Complex queries: Advanced SQL operations without S3 latency
  • Development workflows: Local data exploration and testing

Backup-Free Architecture

  • S3 Replication: Cloud storage eliminates backup infrastructure needs
  • Disaster Recovery: Restore local databases from S3 Parquet files
  • Operational Simplicity: No backup schedules or storage management

Future Roadmap Features

  • Window Queries: Time-series analysis using local database as source
  • FlightSQL Integration: Direct BI tool connectivity to local databases
  • Historical Analysis: Live queries over past N hours of ingested data
  • Real-time Dashboards: Low-latency queries for operational monitoring
  • Read Replicas: Create unlimited copies from S3 data

Configuration

Basic Configuration

yaml
# Enable high-performance DuckDB persistence
duckdb_persistence:
  enabled: true # Enable DuckDB persistence
  storage_path: "/data/duckdb" # Directory for database files
  max_writers: 10 # Concurrent database writers

# Core S3 pipeline continues unchanged
storage:
  backends:
    - name: "primary-s3"
      backend_type: "s3"
      enabled: true
      primary: true
      # ... S3 configuration

Configuration Options

FieldTypeDefaultDescription
duckdb_persistence.enabledbooleanfalseEnable DuckDB persistence
duckdb_persistence.storage_pathstring"/tmp/duckdb/topics"Directory for shared database files
duckdb_persistence.max_writersnumber10Number of concurrent database writers
duckdb_persistence.dry_runbooleanfalseProcess Arrow data but skip writes
duckdb_persistence.super_dry_runbooleanfalseCompletely skip DuckDB processing

Environment Variable Configuration

bash
# Enable DuckDB persistence via environment variables
DUCKDB_PERSISTENCE_ENABLED=true
DUCKDB_PERSISTENCE_STORAGE_PATH=/data/duckdb
DUCKDB_PERSISTENCE_MAX_WRITERS=16

# Run BoilStream with dual storage
boilstream --config config.yaml

Performance Characteristics

Ingestion Performance

Ultra-High Throughput:

  • 10+ million rows/second into local DuckDB databases
  • Concurrent writers scale with CPU cores
  • Memory-efficient Arrow-to-DuckDB conversion
  • Zero-copy data processing where possible

Comparison with S3 Pipeline:

  • S3 Pipeline: 3+ GB/s diskless streaming
  • DuckDB Layer: 10M+ rows/s with ~5-10% overhead
  • Combined: Dual persistence with minimal performance impact

Query Performance

Local Database Benefits:

  • Sub-millisecond query response times
  • Cross-topic joins without S3 roundtrips
  • Complex aggregations with DuckDB's optimized engine
  • Window functions for time-series analysis

Storage Efficiency:

  • Shared database files eliminate duplication
  • Columnar storage with compression
  • Automatic schema evolution and indexing

Database File Structure

Shared Database Architecture Example

text
/data/duckdb/
├── shared_db_0.duckdb    # Database writer 0
├── shared_db_1.duckdb    # Database writer 1
├── shared_db_2.duckdb    # Database writer 2
└── ...

Key Characteristics:

  • Multiple topics stored in each database file
  • Priority based load balancing across database writers
  • Cross-topic queries within and across databases
  • Table naming: topic_{id} for each topic

Query Examples

sql
-- Connect to local DuckDB databases
ATTACH '/data/duckdb/shared_db_0.duckdb' AS db0;
ATTACH '/data/duckdb/shared_db_1.duckdb' AS db1;

-- Query single topic
SELECT * FROM db0.topic_12345 WHERE timestamp > NOW() - INTERVAL 1 HOUR;

-- Cross-topic join
SELECT u.user_id, e.event_type, e.timestamp
FROM db0.topic_users u
JOIN db1.topic_events e ON u.user_id = e.user_id
WHERE e.timestamp > NOW() - INTERVAL 1 DAY;

-- Time-series aggregation
SELECT
  DATE_TRUNC('hour', timestamp) as hour,
  COUNT(*) as event_count
FROM db0.topic_events
WHERE timestamp > NOW() - INTERVAL 24 HOURS
GROUP BY hour
ORDER BY hour;

Development Workflows

Local Development Setup

yaml
# Development with DuckDB persistence
duckdb_persistence:
  enabled: true
  storage_path: "./dev-duckdb"
  max_writers: 4

storage:
  backends:
    - name: "local-minio"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "http://localhost:9000"
      bucket: "dev-bucket"
      access_key: "minioadmin"
      secret_key: "minioadmin"
      use_path_style: true

Production Deployment

yaml
# Production DuckDB persistence
duckdb_persistence:
  enabled: true
  storage_path: "/data/duckdb"
  max_writers: 16 # Scale with CPU cores (2x cores recommended)

# Ensure S3 backend remains primary
storage:
  backends:
    - name: "production-s3"
      backend_type: "s3"
      enabled: true
      primary: true # S3 failures fail the operation
      # ... S3 config

Performance Tuning

Storage Configuration:

bash
# Use fast local storage for optimal performance
duckdb_persistence:
  storage_path: "/nvme/duckdb"  # NVMe SSD recommended

Memory Optimization:

yaml
# Tune for high-memory instances
processing:
  data_processing_threads: 16
  buffer_pool_max_size: 100

Monitoring and Operations

Key Metrics

Monitor DuckDB persistence performance:

  • Ingestion Rate: Rows per second into local databases
  • Write Latency: Time from ingestion to database commit
  • Database Size: Growth rate of database files
  • Query Performance: Local query response times

Health Checks

bash
# Check database file sizes
ls -lh /data/duckdb/

# Verify database accessibility
duckdb /data/duckdb/shared_db_0.duckdb "SELECT COUNT(*) FROM topic_12345;"

Disaster Recovery

Restore from S3:

  1. DuckDB databases are expendable (S3 is primary)
  2. Recreate local databases by replaying S3 Parquet data
  3. Use S3 data for unlimited read replicas

Migration:

  1. Scale up: Add more database writers
  2. Scale out: Add more BoilStream nodes
  3. Storage migration: Move database files to faster storage

Roadmap Features

Window Queries (Planned)

sql
-- Future: Window functions using DuckDB as source
SELECT
  user_id,
  event_type,
  COUNT(*) OVER (
    PARTITION BY user_id
    ORDER BY timestamp
    RANGE BETWEEN INTERVAL 1 HOUR PRECEDING AND CURRENT ROW
  ) as events_last_hour
FROM local_events
WHERE timestamp > NOW() - INTERVAL 24 HOURS;

FlightSQL Integration (Planned)

text
BI Tools → FlightSQL API → Local DuckDB Databases
  ↳ Tableau, PowerBI, Grafana direct connectivity

Historical Analysis (Planned)

sql
-- Future: Live queries over rolling time windows
SELECT * FROM past_24_hours('topic_events')
WHERE user_type = 'premium';

Best Practices

When to Enable

Enable DuckDB Persistence When:

  • You need cross-topic analytical queries
  • Local query performance is critical
  • Complex time-series analysis is required
  • BI tools need direct database connectivity

S3-Only May Be Sufficient When:

  • Simple append-only workloads
  • Analytics tools can query S3 directly
  • Minimal local storage is preferred
  • S3 Analytics performance is enough

Performance Optimization

  1. Scale writers (recommended: 2x number of CPU cores to match instance capabilities)
  2. Use fast local storage (NVMe SSD preferred)
  3. Monitor database file sizes and plan storage capacity
  4. Leverage S3 for backup-free architecture

Security Considerations

  • DuckDB files contain same data as S3 Parquet files
  • (pro-version) FlightSQL with access control for querying DuckDB files

This dual-storage architecture provides the best of both worlds: cloud-scale analytics with S3 and high-performance real-time local queries with DuckDB, all while eliminating backup infrastructure complexity.