Skip to content

DuckLake Integration

BoilStream provides managed DuckLakes with automatic credential management and optional streaming ingestion capabilities.

Catalog Types

DuckLakes are created through the boilstream extension. The catalog name determines its capabilities:

Catalog TypeName PatternFeatures
Streaming*__streamIngestion schemas, meta columns, topic provisioning
WarehouseNo suffixPure data lake, standard DuckLake features

Examples:

  • analytics__stream - Streaming catalog with ingestion interfaces
  • analytics - Standard data warehouse catalog

Streaming Catalogs

Catalogs with the __stream suffix automatically get:

  • Ingestion schemas - Tables for Kafka, HTTP, and FlightRPC data ingestion
  • Meta columns - System columns (__boils_meta_*) for tracking
  • Topic provisioning - Automatic topic creation for tables
  • Time partitioning - Automatic year/month/day/hour partitioning
  • User-defined partitions - Custom partition columns (e.g., tenant_id, region) with automatic time partitioning appended

Ingestion interfaces:

  • Kafka - v2 RecordBatch message format, Confluent Schema Registry, Avro payload
  • HTTP - REST API for data ingestion
  • FlightRPC - Arrow Flight protocol (use with airport extension)

High-Performance Avro Decoding

BoilStream uses a JIT (Just-In-Time compiled) Avro-to-Arrow decoder that achieves 3-5x throughput improvement over the Rust arrow-avro crate. The decoder generates optimized machine code at runtime for each Avro schema, with bounds checking on all field accesses for security.

Partitioning

Automatic Time Partitioning

Streaming catalogs use Iceberg-style epoch-based time partitioning:

TransformExampleMeaning
year552025 (years since 1970)
month660January 2025
day20088Days since epoch
hour482112Hours since epoch

Path structure: year=55/month=660/day=20088/hour=482112/

User-Defined Partition Columns

You can add custom partition columns (e.g., tenant_id, customer, region) which appear before time partitioning:

region=us-east-1/year=55/month=660/day=20088/hour=482112/

Partition Limits

  • Maximum 100 concurrent partition writers per topic per thread
  • Exceeding this limit routes data to __DEFAULT__ partition

Special Partition Values

ValueDescription
__NULL__NULL column values
__EMPTY__Empty string values
__DEFAULT__Overflow partition (>100 partitions)

URL Encoding

Partition values containing special characters (/, =, %, space) are automatically URL-encoded to ensure valid path names.

Using the BoilStream Extension

The boilstream community extension provides secure credential vending and DuckLake management.

Installation

sql
INSTALL boilstream FROM community;
LOAD boilstream;

Authentication

Option 1: Email/Password Login

sql
-- Login with email, password, and MFA code
PRAGMA boilstream_login('https://your-server.com/user@example.com',
    'password', '123456');

Option 2: Bootstrap Token (recommended for automation)

sql
-- Authenticate using a bootstrap token from the web dashboard
PRAGMA boilstream_bootstrap_session(
    'https://your-server.com/secrets:<token>');

Bootstrap Tokens

Bootstrap tokens can be generated from the BoilStream web dashboard. They're ideal for scripts, CI/CD pipelines, and automated workflows where interactive MFA login isn't practical.

Managing DuckLakes

sql
-- List available ducklakes
FROM boilstream_ducklakes();

-- Create a new ducklake (add __stream suffix for streaming features)
PRAGMA boilstream_create_ducklake('my_analytics__stream',
    'Analytics streaming catalog');

-- List cached secrets with expiration
FROM boilstream_secrets();

Creating Tables with Partitioning

Create tables with user-defined partition columns. Time partitioning (year/month/day/hour) is automatically added - you only need to specify your custom partition columns:

sql
USE my_analytics__stream;

-- Create table
CREATE TABLE events (
    tenant_id VARCHAR,
    region VARCHAR,
    event_type VARCHAR,
    payload VARCHAR
);

-- Add partitioning: only specify your custom columns
-- Time partitioning (year/month/day/hour) is added automatically
ALTER TABLE events SET PARTITIONED BY (tenant_id);

Data will be stored with paths like:

tenant_id=acme/year=55/month=660/day=20088/hour=482112/

Multiple user partition columns:

sql
ALTER TABLE events SET PARTITIONED BY (tenant_id, region);

Path structure with multiple user columns:

tenant_id=acme/region=us-east-1/year=55/month=660/day=20088/hour=482112/

Automatic Time Partitioning

For streaming catalogs, BoilStream automatically appends year/month/day/hour time partitions based on __boils_meta_timestamp. You don't need to (and shouldn't) specify these manually. If you do specify them, they will be replaced with the standard time partitions.

Querying Data

After login, DuckLake catalogs are automatically available:

sql
USE my_analytics__stream;
SELECT * FROM events LIMIT 10;

Available Commands

sql
-- Show all available commands
PRAGMA boilstream_help;
CommandDescription
boilstream_bootstrap_session(url)Authenticate with bootstrap token
boilstream_register_user(url, password)Register new user with MFA
boilstream_verify_mfa(totp_code)Complete registration
boilstream_login(url, password, mfa)Email/password/MFA login
boilstream_ducklakes()List available ducklakes
boilstream_secrets()List cached secrets
boilstream_create_ducklake(name, desc)Create a new ducklake
......

Hot/Cold Tiered Storage

Streaming catalogs use a two-tier storage architecture for optimal query latency:

Architecture Overview

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Kafka/HTTP/    │     │    Hot Tier     │     │    Cold Tier    │
│  FlightRPC      │────▶│    (DuckDB)     │────▶│  (S3 Parquet)   │
│  Ingestion      │     │   ~1s commit    │     │  1-10 min snap  │
└─────────────────┘     └─────────────────┘     └─────────────────┘


                        ┌─────────────────┐
                        │  pgwire Query   │
                        │  Interception   │
                        │ (Hot Inlining)  │
                        └─────────────────┘

Hot Tier (Near Real-Time)

The hot tier provides ~1 second data visibility for freshly ingested data:

  • Data ingested via Kafka, HTTP/2, or FlightRPC is immediately written to DuckDB
  • 1-second commit intervals register new data with DuckLake metadata
  • Remote DuckDB clients querying via pgwire automatically see hot data
  • Zero additional configuration required - works out of the box

Query Interception

When querying streaming tables through BoilStream's pgwire interface, queries are automatically intercepted and the hot tier data is inlined with cold tier results. This provides unified query results spanning both tiers transparently.

Cold Tier (Durable Storage)

The cold tier provides long-term durable storage on S3:

  • Parquet snapshots written at configurable intervals (1-10 minutes)
  • Registered with DuckLake catalog for standard Iceberg-compatible access
  • Optimized for analytical queries over historical data
  • Supports time-travel and snapshot isolation

Latency Comparison

Access MethodData VisibilityUse Case
Hot Tier (pgwire)~1 secondReal-time dashboards, alerts
Cold Tier (Direct DuckLake)1-10 minutesBatch analytics, ETL pipelines

Example: Real-Time Query

sql
-- Connect to BoilStream via pgwire (hot tier enabled)
-- psql -h your-server.com -U user -d my_analytics__stream

-- Query sees data ingested within the last second
SELECT event_type, COUNT(*) as count
FROM events
WHERE __boils_meta_timestamp > NOW() - INTERVAL '5 minutes'
GROUP BY event_type;

Configuration

Hot tier commit interval and cold tier snapshot frequency can be configured:

yaml
# boilstream.yaml
ducklake:
  hot_tier_commit_interval_ms: 1000    # 1 second (default)
  cold_tier_snapshot_interval_s: 60    # 1 minute (default)

Performance Note

The hot tier is optimized for streaming workloads. For heavy analytical queries over large historical datasets, query the cold tier directly using a DuckDB client with the DuckLake extension.

Best Practices

  1. Use __stream suffix only when you need ingestion interfaces
  2. Use boilstream extension for automatic credential management
  3. Credentials auto-expire for security - extension handles renewal
  4. Separate catalogs for different use cases (streaming vs analytics)
  5. Use pgwire for real-time queries - automatic hot tier inlining provides ~1s latency
  6. Use direct DuckLake for batch analytics - cold tier optimized for large scans