DuckLake Integration
DuckLake provides metadata management for your data lake. BoilStream integrates with DuckLake to automatically register Parquet files in your catalog as they are written to storage. When Parquet files are added to DuckLake, corresponding DuckLake table is created if it does not already exist. The topic name becomes the table name and table schema is inferred from the uploaded Parquet file.
Overview
DuckLake integration allows you to:
- Automatic file registration: Parquet files are automatically registered with DuckLake catalogs after successful upload
- NOTE: DuckDB client side SQL transactions are not the same as DuckLake catalog updates per Parquet file since BoilStream aggregates data into the Parquet files.
- Metadata management: Centralized schema and table metadata through PostgreSQL or other catalogs
- Multi-catalog support: Configure multiple DuckLake catalogs for different use cases
- Reconciliation: Automatically sync missing files between storage and catalog. BoilStream does not stop if the catalog connection fails (e.g. postgres server goes down).
Configuration
DuckLake is configured in the ducklake
section of your configuration file. You can configure multiple DuckLake catalogs simultaneously.
Basic Configuration
ducklake:
- name: my_ducklake
data_path: "s3://ingestion-data/"
attach: |
FROM duckdb_secrets();
-- Load extensions
INSTALL ducklake;
INSTALL postgres;
INSTALL aws;
INSTALL httpfs;
LOAD ducklake;
LOAD postgres;
LOAD aws;
LOAD httpfs;
-- S3 access for MinIO
CREATE OR REPLACE SECRET secretForDirectS3Access (
TYPE S3,
KEY_ID 'minioadmin',
SECRET 'minioadmin',
REGION 'us-east-1',
ENDPOINT 'localhost:9000',
USE_SSL false,
URL_STYLE 'path',
SCOPE 's3://ingestion-data/'
);
-- PostgreSQL secret for DuckLake catalog backend
CREATE OR REPLACE SECRET postgres (
TYPE POSTGRES,
HOST 'localhost',
PORT 5432,
DATABASE 'boilstream',
USER 'postgres',
PASSWORD 'postgres'
);
-- DuckLake Postgres catalog
CREATE OR REPLACE SECRET pg_secret (
TYPE DUCKLAKE,
METADATA_PATH '',
DATA_PATH 's3://ingestion-data/',
METADATA_PARAMETERS MAP {'TYPE': 'postgres', 'SECRET': 'postgres'}
);
-- DuckLake attachment with Postgres catalog
ATTACH 'ducklake:pg_secret' AS my_ducklake;
reconciliation:
on_startup: true
interval_minutes: 60
max_concurrent_registrations: 10
Configuration Fields
Field | Type | Default | Description |
---|---|---|---|
name | string | - | Unique identifier for this DuckLake catalog |
data_path | string | - | S3 path where Parquet files are stored (must match storage backend) |
attach | string | - | SQL statements to execute for DuckLake attachment and setup |
topics | array | all topics | Optional: Specify which topics to include in this catalog |
reconciliation.on_startup | boolean | true | Run reconciliation when application starts |
reconciliation.interval_minutes | number | 60 | Check for missing files every N minutes |
reconciliation.max_concurrent_registrations | number | 10 | Parallel registration limit |
Storage Backend Integration
DuckLake catalogs must be configured with storage backends that have DuckLake integration enabled:
storage:
backends:
- name: "primary-s3"
backend_type: "s3"
enabled: true
primary: true
endpoint: "http://localhost:9000"
bucket: "ingestion-data"
# ... other S3 configuration
# DuckLake integration - register files with these catalogs
ducklake: ["my_ducklake"]
ducklake:
- name: my_ducklake
data_path: "s3://ingestion-data/"
# ... DuckLake configuration
The ducklake
field in storage backends specifies which DuckLake catalogs should automatically register files written by that backend.
Setup Examples
MinIO + PostgreSQL Catalog
# Storage backend configuration
storage:
backends:
- name: "primary-s3"
backend_type: "s3"
enabled: true
primary: true
endpoint: "http://localhost:9000"
access_key: "minioadmin"
secret_key: "minioadmin"
region: "us-east-1"
use_path_style: true
bucket: "ingestion-data"
prefix: "/"
ducklake: ["my_ducklake"]
# DuckLake configuration
ducklake:
- name: my_ducklake
data_path: "s3://ingestion-data/"
attach: |
FROM duckdb_secrets();
-- Load required extensions
INSTALL ducklake;
INSTALL postgres;
INSTALL aws;
INSTALL httpfs;
LOAD ducklake;
LOAD postgres;
LOAD aws;
LOAD httpfs;
-- S3 access for MinIO
CREATE OR REPLACE SECRET secretForDirectS3Access (
TYPE S3,
KEY_ID 'minioadmin',
SECRET 'minioadmin',
REGION 'us-east-1',
ENDPOINT 'localhost:9000',
USE_SSL false,
URL_STYLE 'path',
SCOPE 's3://ingestion-data/'
);
-- PostgreSQL secret for DuckLake catalog backend
CREATE OR REPLACE SECRET postgres (
TYPE POSTGRES,
HOST 'localhost',
PORT 5433,
DATABASE 'boilstream',
USER 'postgres',
PASSWORD 'postgres'
);
-- DuckLake Postgres catalog
CREATE OR REPLACE SECRET pg_secret (
TYPE DUCKLAKE,
METADATA_PATH '',
DATA_PATH 's3://ingestion-data/',
METADATA_PARAMETERS MAP {'TYPE': 'postgres', 'SECRET': 'postgres'}
);
-- DuckLake attachment with Postgres catalog
ATTACH 'ducklake:pg_secret' AS my_ducklake;
reconciliation:
on_startup: true
interval_minutes: 60
max_concurrent_registrations: 10
AWS S3 + RDS PostgreSQL
storage:
backends:
- name: "production-s3"
backend_type: "s3"
enabled: true
primary: true
endpoint: "https://s3.amazonaws.com"
bucket: "my-production-bucket"
access_key: "${AWS_ACCESS_KEY_ID}"
secret_key: "${AWS_SECRET_ACCESS_KEY}"
region: "us-east-1"
ducklake: ["production_catalog"]
ducklake:
- name: production_catalog
data_path: "s3://my-production-bucket/"
attach: |
-- Load extensions
INSTALL ducklake;
INSTALL postgres;
INSTALL aws;
LOAD ducklake;
LOAD postgres;
LOAD aws;
-- AWS S3 access
CREATE OR REPLACE SECRET s3_access (
TYPE S3,
KEY_ID '${AWS_ACCESS_KEY_ID}',
SECRET '${AWS_SECRET_ACCESS_KEY}',
REGION 'us-east-1',
SCOPE 's3://my-production-bucket/'
);
-- RDS PostgreSQL catalog
CREATE OR REPLACE SECRET rds_postgres (
TYPE POSTGRES,
HOST 'my-rds-instance.region.rds.amazonaws.com',
PORT 5432,
DATABASE 'ducklake_catalog',
USER 'ducklake_user',
PASSWORD '${RDS_PASSWORD}'
);
-- DuckLake catalog configuration
CREATE OR REPLACE SECRET production_secret (
TYPE DUCKLAKE,
METADATA_PATH '',
DATA_PATH 's3://my-production-bucket/',
METADATA_PARAMETERS MAP {'TYPE': 'postgres', 'SECRET': 'rds_postgres'}
);
-- Attach DuckLake catalog
ATTACH 'ducklake:production_secret' AS production_catalog;
reconciliation:
on_startup: true
interval_minutes: 30
max_concurrent_registrations: 20
Multi-Catalog Setup
storage:
backends:
- name: "main-storage"
backend_type: "s3"
enabled: true
primary: true
endpoint: "http://localhost:9000"
bucket: "ingestion-data"
# Register with both catalogs
ducklake: ["analytics_catalog", "operational_catalog"]
ducklake:
# Analytics catalog for BI tools
- name: analytics_catalog
data_path: "s3://ingestion-data/"
# Include only analytics topics
topics: ["events", "user_interactions", "metrics"]
attach: |
# ... setup for analytics PostgreSQL database
ATTACH 'ducklake:analytics_secret' AS analytics_catalog;
reconciliation:
on_startup: true
interval_minutes: 60
# Operational catalog for real-time monitoring
- name: operational_catalog
data_path: "s3://ingestion-data/"
# Include only operational topics
topics: ["logs", "alerts", "health_checks"]
attach: |
# ... setup for operational PostgreSQL database
ATTACH 'ducklake:operational_secret' AS operational_catalog;
reconciliation:
on_startup: true
interval_minutes: 15
File Registration
When DuckLake integration is enabled, BoilStream automatically:
- Writes Parquet files to the configured storage backend
- Registers files with all specified DuckLake catalogs after successful upload
- Handles failures gracefully - catalog registration failures don't affect data ingestion
- Provides logging for all registration attempts and results
Registration Process
- File is successfully written to storage backend
- For each DuckLake catalog listed in the backend's
ducklake
field:- Connect to DuckLake using the
attach
SQL statements - Execute
CALL ducklake_add_data_files()
with the file path - Log success or failure
- Connect to DuckLake using the
Error Handling
- Storage failures: Prevent file registration (file must be written successfully first)
- Catalog failures: Logged but don't affect ingestion pipeline
- Connection failures: Retried automatically with exponential backoff
- SQL failures: Logged with full error details for debugging
Reconciliation
DuckLake reconciliation ensures that all files in storage are properly registered in the catalog. This is useful for:
- Startup recovery: Register files that may have been missed during downtime
- Periodic maintenance: Catch any registration failures that weren't retried
- Manual verification: Ensure storage and catalog are synchronized
Reconciliation Configuration
ducklake:
- name: my_ducklake
# ... other configuration
reconciliation:
on_startup: true # Run reconciliation when application starts
interval_minutes: 60 # Check for missing files every hour
max_concurrent_registrations: 10 # Parallel registration limit
Reconciliation Process
- List files in the storage backend's data path
- Check catalog for each file's registration status
- Register missing files using
CALL ducklake_add_data_files()
- Log results for monitoring and debugging
Monitoring and Troubleshooting
Logging
DuckLake operations are logged with detailed information:
INFO ducklake: Registering file with catalog my_ducklake: s3://bucket/path/file.parquet
INFO ducklake: Successfully registered file with catalog my_ducklake
WARN ducklake: Failed to register file with catalog my_ducklake: Table does not exist
Common Issues
Connection Failures:
- Verify PostgreSQL connection parameters in the
attach
SQL - Check network connectivity between BoilStream and PostgreSQL
- Ensure DuckLake extension is installed and loaded
Permission Errors:
- Verify S3 access credentials and permissions
- Check PostgreSQL user permissions for catalog operations
- Ensure DuckLake schema exists and is accessible
Schema Mismatches:
- Verify that data paths match between storage backends and DuckLake configuration
- Check that table schemas are compatible with DuckLake requirements
- Ensure Parquet files are valid and readable
Performance Issues:
- Adjust
max_concurrent_registrations
based on your PostgreSQL capacity - Monitor reconciliation intervals to avoid overwhelming the catalog
- Consider separating high-volume topics into different catalogs
Environment Variables
DuckLake configuration supports environment variable substitution:
ducklake:
- name: my_ducklake
data_path: "s3://${S3_BUCKET}/"
attach: |
CREATE OR REPLACE SECRET postgres (
TYPE POSTGRES,
HOST '${POSTGRES_HOST}',
PORT ${POSTGRES_PORT},
DATABASE '${POSTGRES_DB}',
USER '${POSTGRES_USER}',
PASSWORD '${POSTGRES_PASSWORD}'
);
Set the environment variables:
export S3_BUCKET=my-data-bucket
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
export POSTGRES_DB=ducklake_catalog
export POSTGRES_USER=ducklake_user
export POSTGRES_PASSWORD=secret_password
Best Practices
- Use environment variables for sensitive configuration like passwords
- Configure reconciliation appropriately for your data volume and requirements
- Monitor DuckLake logs for registration failures and performance issues
- Test catalog connectivity before deploying to production
- Use separate catalogs for different use cases (analytics vs operational)
- Backup your catalog database regularly since it contains critical metadata
- Consider PostgreSQL performance when setting concurrent registration limits