DuckDB Persistence
BoilStream provides optional high-performance local DuckDB persistence alongside its core diskless S3 pipeline. This dual-storage architecture enables immediate local querying while maintaining cloud-scale analytics capabilities.
Architecture Overview
Dual Storage Design
Client Data → BoilStream → S3 Parquet Files (diskless, primary)
→ Shared DuckDB Files (local, optional)
Core S3 Pipeline (Always Active):
- Diskless streaming to S3-compatible storage
- No local storage dependencies
- Immediate data availability for analytics
- Eliminates local storage failure points
Optional DuckDB Layer:
- 10+ million rows/second ingestion performance
- Shared database files across multiple topics
- Cross-topic queries and joins
- No backup infrastructure needed (S3 provides replication)
When to Enable DuckDB Persistence
High-Performance Local Queries
- Ultra-fast ingestion: 10+ million rows/second into local databases
- Cross-topic analytics: Join data across multiple topics in shared databases
- Complex queries: Advanced SQL operations without S3 latency
- Development workflows: Local data exploration and testing
Backup-Free Architecture
- S3 Replication: Cloud storage eliminates backup infrastructure needs
- Disaster Recovery: Restore local databases from S3 Parquet files
- Operational Simplicity: No backup schedules or storage management
Future Roadmap Features
- Window Queries: Time-series analysis using local database as source
- FlightSQL Integration: Direct BI tool connectivity to local databases
- Historical Analysis: Live queries over past N hours of ingested data
- Real-time Dashboards: Low-latency queries for operational monitoring
- Read Replicas: Create unlimited copies from S3 data
Configuration
Basic Configuration
# Enable high-performance DuckDB persistence
duckdb_persistence:
enabled: true # Enable DuckDB persistence
storage_path: "/data/duckdb" # Directory for database files
max_writers: 10 # Concurrent database writers
# Core S3 pipeline continues unchanged
storage:
backends:
- name: "primary-s3"
backend_type: "s3"
enabled: true
primary: true
# ... S3 configuration
Configuration Options
Field | Type | Default | Description |
---|---|---|---|
duckdb_persistence.enabled | boolean | false | Enable DuckDB persistence |
duckdb_persistence.storage_path | string | "/tmp/duckdb/topics" | Directory for shared database files |
duckdb_persistence.max_writers | number | 10 | Number of concurrent database writers |
duckdb_persistence.dry_run | boolean | false | Process Arrow data but skip writes |
duckdb_persistence.super_dry_run | boolean | false | Completely skip DuckDB processing |
Environment Variable Configuration
# Enable DuckDB persistence via environment variables
DUCKDB_PERSISTENCE_ENABLED=true
DUCKDB_PERSISTENCE_STORAGE_PATH=/data/duckdb
DUCKDB_PERSISTENCE_MAX_WRITERS=16
# Run BoilStream with dual storage
boilstream --config config.yaml
Performance Characteristics
Ingestion Performance
Ultra-High Throughput:
- 10+ million rows/second into local DuckDB databases
- Concurrent writers scale with CPU cores
- Memory-efficient Arrow-to-DuckDB conversion
- Zero-copy data processing where possible
Comparison with S3 Pipeline:
- S3 Pipeline: 3+ GB/s diskless streaming
- DuckDB Layer: 10M+ rows/s with ~5-10% overhead
- Combined: Dual persistence with minimal performance impact
Query Performance
Local Database Benefits:
- Sub-millisecond query response times
- Cross-topic joins without S3 roundtrips
- Complex aggregations with DuckDB's optimized engine
- Window functions for time-series analysis
Storage Efficiency:
- Shared database files eliminate duplication
- Columnar storage with compression
- Automatic schema evolution and indexing
Database File Structure
Shared Database Architecture Example
/data/duckdb/
├── shared_db_0.duckdb # Database writer 0
├── shared_db_1.duckdb # Database writer 1
├── shared_db_2.duckdb # Database writer 2
└── ...
Key Characteristics:
- Multiple topics stored in each database file
- Priority based load balancing across database writers
- Cross-topic queries within and across databases
- Table naming:
topic_{id}
for each topic
Query Examples
-- Connect to local DuckDB databases
ATTACH '/data/duckdb/shared_db_0.duckdb' AS db0;
ATTACH '/data/duckdb/shared_db_1.duckdb' AS db1;
-- Query single topic
SELECT * FROM db0.topic_12345 WHERE timestamp > NOW() - INTERVAL 1 HOUR;
-- Cross-topic join
SELECT u.user_id, e.event_type, e.timestamp
FROM db0.topic_users u
JOIN db1.topic_events e ON u.user_id = e.user_id
WHERE e.timestamp > NOW() - INTERVAL 1 DAY;
-- Time-series aggregation
SELECT
DATE_TRUNC('hour', timestamp) as hour,
COUNT(*) as event_count
FROM db0.topic_events
WHERE timestamp > NOW() - INTERVAL 24 HOURS
GROUP BY hour
ORDER BY hour;
Development Workflows
Local Development Setup
# Development with DuckDB persistence
duckdb_persistence:
enabled: true
storage_path: "./dev-duckdb"
max_writers: 4
storage:
backends:
- name: "local-minio"
backend_type: "s3"
enabled: true
primary: true
endpoint: "http://localhost:9000"
bucket: "dev-bucket"
access_key: "minioadmin"
secret_key: "minioadmin"
use_path_style: true
Production Deployment
Recommended Settings
# Production DuckDB persistence
duckdb_persistence:
enabled: true
storage_path: "/data/duckdb"
max_writers: 16 # Scale with CPU cores (2x cores recommended)
# Ensure S3 backend remains primary
storage:
backends:
- name: "production-s3"
backend_type: "s3"
enabled: true
primary: true # S3 failures fail the operation
# ... S3 config
Performance Tuning
Storage Configuration:
# Use fast local storage for optimal performance
duckdb_persistence:
storage_path: "/nvme/duckdb" # NVMe SSD recommended
Memory Optimization:
# Tune for high-memory instances
processing:
data_processing_threads: 16
buffer_pool_max_size: 100
Monitoring and Operations
Key Metrics
Monitor DuckDB persistence performance:
- Ingestion Rate: Rows per second into local databases
- Write Latency: Time from ingestion to database commit
- Database Size: Growth rate of database files
- Query Performance: Local query response times
Health Checks
# Check database file sizes
ls -lh /data/duckdb/
# Verify database accessibility
duckdb /data/duckdb/shared_db_0.duckdb "SELECT COUNT(*) FROM topic_12345;"
Disaster Recovery
Restore from S3:
- DuckDB databases are expendable (S3 is primary)
- Recreate local databases by replaying S3 Parquet data
- Use S3 data for unlimited read replicas
Migration:
- Scale up: Add more database writers
- Scale out: Add more BoilStream nodes
- Storage migration: Move database files to faster storage
Roadmap Features
Window Queries (Planned)
-- Future: Window functions using DuckDB as source
SELECT
user_id,
event_type,
COUNT(*) OVER (
PARTITION BY user_id
ORDER BY timestamp
RANGE BETWEEN INTERVAL 1 HOUR PRECEDING AND CURRENT ROW
) as events_last_hour
FROM local_events
WHERE timestamp > NOW() - INTERVAL 24 HOURS;
FlightSQL Integration (Planned)
BI Tools → FlightSQL API → Local DuckDB Databases
↳ Tableau, PowerBI, Grafana direct connectivity
Historical Analysis (Planned)
-- Future: Live queries over rolling time windows
SELECT * FROM past_24_hours('topic_events')
WHERE user_type = 'premium';
Best Practices
When to Enable
Enable DuckDB Persistence When:
- You need cross-topic analytical queries
- Local query performance is critical
- Complex time-series analysis is required
- BI tools need direct database connectivity
S3-Only May Be Sufficient When:
- Simple append-only workloads
- Analytics tools can query S3 directly
- Minimal local storage is preferred
- S3 Analytics performance is enough
Performance Optimization
- Scale writers (recommended: 2x number of CPU cores to match instance capabilities)
- Use fast local storage (NVMe SSD preferred)
- Monitor database file sizes and plan storage capacity
- Leverage S3 for backup-free architecture
Security Considerations
- DuckDB files contain same data as S3 Parquet files
- (pro-version) FlightSQL with access control for querying DuckDB files
This dual-storage architecture provides the best of both worlds: cloud-scale analytics with S3 and high-performance real-time local queries with DuckDB, all while eliminating backup infrastructure complexity.