DuckLake Integration
BoilStream provides managed DuckLakes with automatic credential management and optional streaming ingestion capabilities.
Catalog Types
DuckLakes are created through the boilstream extension. The catalog name determines its capabilities:
| Catalog Type | Name Pattern | Features |
|---|---|---|
| Streaming | *__stream | Ingestion schemas, meta columns, topic provisioning |
| Warehouse | No suffix | Pure data lake, standard DuckLake features |
Examples:
analytics__stream- Streaming catalog with ingestion interfacesanalytics- Standard data warehouse catalog
Streaming Catalogs
Catalogs with the __stream suffix automatically get:
- Ingestion schemas - Tables for Kafka, HTTP, and FlightRPC data ingestion
- Meta columns - System columns (
__boils_meta_*) for tracking - Topic provisioning - Automatic topic creation for tables
- Time partitioning - Automatic
year/month/day/hourpartitioning - User-defined partitions - Custom partition columns (e.g.,
tenant_id,region) with automatic time partitioning appended
Ingestion interfaces:
- Kafka - v2 RecordBatch message format, Confluent Schema Registry, Avro payload
- HTTP - REST API for data ingestion
- FlightRPC - Arrow Flight protocol (use with airport extension)
High-Performance Avro Decoding
BoilStream uses a JIT (Just-In-Time compiled) Avro-to-Arrow decoder that achieves 3-5x throughput improvement over the Rust arrow-avro crate. The decoder generates optimized machine code at runtime for each Avro schema, with bounds checking on all field accesses for security.
Partitioning
Automatic Time Partitioning
Streaming catalogs use Iceberg-style epoch-based time partitioning:
| Transform | Example | Meaning |
|---|---|---|
year | 55 | 2025 (years since 1970) |
month | 660 | January 2025 |
day | 20088 | Days since epoch |
hour | 482112 | Hours since epoch |
Path structure: year=55/month=660/day=20088/hour=482112/
User-Defined Partition Columns
You can add custom partition columns (e.g., tenant_id, customer, region) which appear before time partitioning:
region=us-east-1/year=55/month=660/day=20088/hour=482112/Partition Limits
- Maximum 100 concurrent partition writers per topic per thread
- Exceeding this limit routes data to
__DEFAULT__partition
Special Partition Values
| Value | Description |
|---|---|
__NULL__ | NULL column values |
__EMPTY__ | Empty string values |
__DEFAULT__ | Overflow partition (>100 partitions) |
URL Encoding
Partition values containing special characters (/, =, %, space) are automatically URL-encoded to ensure valid path names.
Using the BoilStream Extension
The boilstream community extension provides secure credential vending and DuckLake management.
Installation
INSTALL boilstream FROM community;
LOAD boilstream;Authentication
Option 1: Email/Password Login
-- Login with email, password, and MFA code
PRAGMA boilstream_login('https://your-server.com/user@example.com',
'password', '123456');Option 2: Bootstrap Token (recommended for automation)
-- Authenticate using a bootstrap token from the web dashboard
PRAGMA boilstream_bootstrap_session(
'https://your-server.com/secrets:<token>');Bootstrap Tokens
Bootstrap tokens can be generated from the BoilStream web dashboard. They're ideal for scripts, CI/CD pipelines, and automated workflows where interactive MFA login isn't practical.
Managing DuckLakes
-- List available ducklakes
FROM boilstream_ducklakes();
-- Create a new ducklake (add __stream suffix for streaming features)
PRAGMA boilstream_create_ducklake('my_analytics__stream',
'Analytics streaming catalog');
-- List cached secrets with expiration
FROM boilstream_secrets();Creating Tables with Partitioning
Create tables with user-defined partition columns. Time partitioning (year/month/day/hour) is automatically added - you only need to specify your custom partition columns:
USE my_analytics__stream;
-- Create table
CREATE TABLE events (
tenant_id VARCHAR,
region VARCHAR,
event_type VARCHAR,
payload VARCHAR
);
-- Add partitioning: only specify your custom columns
-- Time partitioning (year/month/day/hour) is added automatically
ALTER TABLE events SET PARTITIONED BY (tenant_id);Data will be stored with paths like:
tenant_id=acme/year=55/month=660/day=20088/hour=482112/Multiple user partition columns:
ALTER TABLE events SET PARTITIONED BY (tenant_id, region);Path structure with multiple user columns:
tenant_id=acme/region=us-east-1/year=55/month=660/day=20088/hour=482112/Automatic Time Partitioning
For streaming catalogs, BoilStream automatically appends year/month/day/hour time partitions based on __boils_meta_timestamp. You don't need to (and shouldn't) specify these manually. If you do specify them, they will be replaced with the standard time partitions.
Querying Data
After login, DuckLake catalogs are automatically available:
USE my_analytics__stream;
SELECT * FROM events LIMIT 10;Available Commands
-- Show all available commands
PRAGMA boilstream_help;| Command | Description |
|---|---|
boilstream_bootstrap_session(url) | Authenticate with bootstrap token |
boilstream_register_user(url, password) | Register new user with MFA |
boilstream_verify_mfa(totp_code) | Complete registration |
boilstream_login(url, password, mfa) | Email/password/MFA login |
boilstream_ducklakes() | List available ducklakes |
boilstream_secrets() | List cached secrets |
boilstream_create_ducklake(name, desc) | Create a new ducklake |
... | ... |
Hot/Cold Tiered Storage
Streaming catalogs use a two-tier storage architecture for optimal query latency:
Architecture Overview
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Kafka/HTTP/ │ │ Hot Tier │ │ Cold Tier │
│ FlightRPC │────▶│ (DuckDB) │────▶│ (S3 Parquet) │
│ Ingestion │ │ ~1s commit │ │ 1-10 min snap │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ pgwire Query │
│ Interception │
│ (Hot Inlining) │
└─────────────────┘Hot Tier (Near Real-Time)
The hot tier provides ~1 second data visibility for freshly ingested data:
- Data ingested via Kafka, HTTP/2, or FlightRPC is immediately written to DuckDB
- 1-second commit intervals register new data with DuckLake metadata
- Remote DuckDB clients querying via pgwire automatically see hot data
- Zero additional configuration required - works out of the box
Query Interception
When querying streaming tables through BoilStream's pgwire interface, queries are automatically intercepted and the hot tier data is inlined with cold tier results. This provides unified query results spanning both tiers transparently.
Cold Tier (Durable Storage)
The cold tier provides long-term durable storage on S3:
- Parquet snapshots written at configurable intervals (1-10 minutes)
- Registered with DuckLake catalog for standard Iceberg-compatible access
- Optimized for analytical queries over historical data
- Supports time-travel and snapshot isolation
Latency Comparison
| Access Method | Data Visibility | Use Case |
|---|---|---|
| Hot Tier (pgwire) | ~1 second | Real-time dashboards, alerts |
| Cold Tier (Direct DuckLake) | 1-10 minutes | Batch analytics, ETL pipelines |
Example: Real-Time Query
-- Connect to BoilStream via pgwire (hot tier enabled)
-- psql -h your-server.com -U user -d my_analytics__stream
-- Query sees data ingested within the last second
SELECT event_type, COUNT(*) as count
FROM events
WHERE __boils_meta_timestamp > NOW() - INTERVAL '5 minutes'
GROUP BY event_type;Configuration
Hot tier commit interval and cold tier snapshot frequency can be configured:
# boilstream.yaml
ducklake:
hot_tier_commit_interval_ms: 1000 # 1 second (default)
cold_tier_snapshot_interval_s: 60 # 1 minute (default)Performance Note
The hot tier is optimized for streaming workloads. For heavy analytical queries over large historical datasets, query the cold tier directly using a DuckDB client with the DuckLake extension.
Best Practices
- Use
__streamsuffix only when you need ingestion interfaces - Use boilstream extension for automatic credential management
- Credentials auto-expire for security - extension handles renewal
- Separate catalogs for different use cases (streaming vs analytics)
- Use pgwire for real-time queries - automatic hot tier inlining provides ~1s latency
- Use direct DuckLake for batch analytics - cold tier optimized for large scans