DuckLake Integration

DuckLake provides metadata management for your data lake. BoilStream integrates with DuckLake to automatically register Parquet files in your catalog as they are written to storage. When Parquet files are added to DuckLake, corresponding DuckLake table is created if it does not already exist. The topic name becomes the table name and table schema is inferred from the uploaded Parquet file.

Overview

DuckLake integration allows you to:

Automatic file registration: Parquet files are automatically registered with DuckLake catalogs after successful upload
- NOTE: DuckDB client side SQL transactions are not the same as DuckLake catalog updates per Parquet file since BoilStream aggregates data into the Parquet files.
Metadata management: Centralized schema and table metadata through PostgreSQL or other catalogs
Multi-catalog support: Configure multiple DuckLake catalogs for different use cases
Reconciliation: Automatically sync missing files between storage and catalog. BoilStream does not stop if the catalog connection fails (e.g. postgres server goes down).

Configuration

DuckLake is configured in the ducklake section of your configuration file. You can configure multiple DuckLake catalogs simultaneously.

Basic Configuration

yaml

ducklake:
  - name: my_ducklake
    data_path: "s3://ingestion-data/"
    attach: |
      FROM duckdb_secrets();
      -- Load extensions
      INSTALL ducklake;
      INSTALL postgres;
      INSTALL aws;
      INSTALL httpfs;
      LOAD ducklake;
      LOAD postgres;
      LOAD aws;
      LOAD httpfs;
      -- S3 access for MinIO
      CREATE OR REPLACE SECRET secretForDirectS3Access (
          TYPE S3,
          KEY_ID 'minioadmin',
          SECRET 'minioadmin',
          REGION 'us-east-1',
          ENDPOINT 'localhost:9000',
          USE_SSL false,
          URL_STYLE 'path',
          SCOPE 's3://ingestion-data/'
      );
      -- PostgreSQL secret for DuckLake catalog backend
      CREATE OR REPLACE SECRET postgres (
        TYPE POSTGRES,
        HOST 'localhost',
        PORT 5432,
        DATABASE 'boilstream',
        USER 'postgres',
        PASSWORD 'postgres'
      );
      -- DuckLake Postgres catalog
      CREATE OR REPLACE SECRET pg_secret (
        TYPE DUCKLAKE,
        METADATA_PATH '',
        DATA_PATH 's3://ingestion-data/',
        METADATA_PARAMETERS MAP {'TYPE': 'postgres', 'SECRET': 'postgres'}
      );
      -- DuckLake attachment with Postgres catalog
      ATTACH 'ducklake:pg_secret' AS my_ducklake;
    reconciliation:
      on_startup: true
      interval_minutes: 60
      max_concurrent_registrations: 10

Configuration Fields

Field	Type	Default	Description
`name`	string	-	Unique identifier for this DuckLake catalog
`data_path`	string	-	S3 path where Parquet files are stored (must match storage backend)
`attach`	string	-	SQL statements to execute for DuckLake attachment and setup
`topics`	array	all topics	Optional: Specify which topics to include in this catalog
`reconciliation.on_startup`	boolean	true	Run reconciliation when application starts
`reconciliation.interval_minutes`	number	60	Check for missing files every N minutes
`reconciliation.max_concurrent_registrations`	number	10	Parallel registration limit

Storage Backend Integration

DuckLake catalogs must be configured with storage backends that have DuckLake integration enabled:

yaml

storage:
  backends:
    - name: "primary-s3"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "http://localhost:9000"
      bucket: "ingestion-data"
      # ... other S3 configuration
      # DuckLake integration - register files with these catalogs
      ducklake: ["my_ducklake"]

ducklake:
  - name: my_ducklake
    data_path: "s3://ingestion-data/"
    # ... DuckLake configuration

The ducklake field in storage backends specifies which DuckLake catalogs should automatically register files written by that backend.

Setup Examples

MinIO + PostgreSQL Catalog

yaml

# Storage backend configuration
storage:
  backends:
    - name: "primary-s3"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "http://localhost:9000"
      access_key: "minioadmin"
      secret_key: "minioadmin"
      region: "us-east-1"
      use_path_style: true
      bucket: "ingestion-data"
      prefix: "/"
      ducklake: ["my_ducklake"]

# DuckLake configuration
ducklake:
  - name: my_ducklake
    data_path: "s3://ingestion-data/"
    attach: |
      FROM duckdb_secrets();
      -- Load required extensions
      INSTALL ducklake;
      INSTALL postgres;
      INSTALL aws;
      INSTALL httpfs;
      LOAD ducklake;
      LOAD postgres;
      LOAD aws;
      LOAD httpfs;

      -- S3 access for MinIO
      CREATE OR REPLACE SECRET secretForDirectS3Access (
          TYPE S3,
          KEY_ID 'minioadmin',
          SECRET 'minioadmin',
          REGION 'us-east-1',
          ENDPOINT 'localhost:9000',
          USE_SSL false,
          URL_STYLE 'path',
          SCOPE 's3://ingestion-data/'
      );

      -- PostgreSQL secret for DuckLake catalog backend
      CREATE OR REPLACE SECRET postgres (
        TYPE POSTGRES,
        HOST 'localhost',
        PORT 5433,
        DATABASE 'boilstream',
        USER 'postgres',
        PASSWORD 'postgres'
      );

      -- DuckLake Postgres catalog
      CREATE OR REPLACE SECRET pg_secret (
        TYPE DUCKLAKE,
        METADATA_PATH '',
        DATA_PATH 's3://ingestion-data/',
        METADATA_PARAMETERS MAP {'TYPE': 'postgres', 'SECRET': 'postgres'}
      );

      -- DuckLake attachment with Postgres catalog
      ATTACH 'ducklake:pg_secret' AS my_ducklake;

    reconciliation:
      on_startup: true
      interval_minutes: 60
      max_concurrent_registrations: 10

AWS S3 + RDS PostgreSQL

yaml

storage:
  backends:
    - name: "production-s3"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "https://s3.amazonaws.com"
      bucket: "my-production-bucket"
      access_key: "${AWS_ACCESS_KEY_ID}"
      secret_key: "${AWS_SECRET_ACCESS_KEY}"
      region: "us-east-1"
      ducklake: ["production_catalog"]

ducklake:
  - name: production_catalog
    data_path: "s3://my-production-bucket/"
    attach: |
      -- Load extensions
      INSTALL ducklake;
      INSTALL postgres;
      INSTALL aws;
      LOAD ducklake;
      LOAD postgres;
      LOAD aws;

      -- AWS S3 access
      CREATE OR REPLACE SECRET s3_access (
          TYPE S3,
          KEY_ID '${AWS_ACCESS_KEY_ID}',
          SECRET '${AWS_SECRET_ACCESS_KEY}',
          REGION 'us-east-1',
          SCOPE 's3://my-production-bucket/'
      );

      -- RDS PostgreSQL catalog
      CREATE OR REPLACE SECRET rds_postgres (
        TYPE POSTGRES,
        HOST 'my-rds-instance.region.rds.amazonaws.com',
        PORT 5432,
        DATABASE 'ducklake_catalog',
        USER 'ducklake_user',
        PASSWORD '${RDS_PASSWORD}'
      );

      -- DuckLake catalog configuration
      CREATE OR REPLACE SECRET production_secret (
        TYPE DUCKLAKE,
        METADATA_PATH '',
        DATA_PATH 's3://my-production-bucket/',
        METADATA_PARAMETERS MAP {'TYPE': 'postgres', 'SECRET': 'rds_postgres'}
      );

      -- Attach DuckLake catalog
      ATTACH 'ducklake:production_secret' AS production_catalog;

    reconciliation:
      on_startup: true
      interval_minutes: 30
      max_concurrent_registrations: 20

Multi-Catalog Setup

yaml

storage:
  backends:
    - name: "main-storage"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "http://localhost:9000"
      bucket: "ingestion-data"
      # Register with both catalogs
      ducklake: ["analytics_catalog", "operational_catalog"]

ducklake:
  # Analytics catalog for BI tools
  - name: analytics_catalog
    data_path: "s3://ingestion-data/"
    # Include only analytics topics
    topics: ["events", "user_interactions", "metrics"]
    attach: |
      # ... setup for analytics PostgreSQL database
      ATTACH 'ducklake:analytics_secret' AS analytics_catalog;
    reconciliation:
      on_startup: true
      interval_minutes: 60

  # Operational catalog for real-time monitoring
  - name: operational_catalog
    data_path: "s3://ingestion-data/"
    # Include only operational topics
    topics: ["logs", "alerts", "health_checks"]
    attach: |
      # ... setup for operational PostgreSQL database
      ATTACH 'ducklake:operational_secret' AS operational_catalog;
    reconciliation:
      on_startup: true
      interval_minutes: 15

File Registration

When DuckLake integration is enabled, BoilStream automatically:

Writes Parquet files to the configured storage backend
Registers files with all specified DuckLake catalogs after successful upload
Handles failures gracefully - catalog registration failures don't affect data ingestion
Provides logging for all registration attempts and results

Registration Process

File is successfully written to storage backend
For each DuckLake catalog listed in the backend's ducklake field:
- Connect to DuckLake using the attach SQL statements
- Execute CALL ducklake_add_data_files() with the file path
- Log success or failure

Error Handling

Storage failures: Prevent file registration (file must be written successfully first)
Catalog failures: Logged but don't affect ingestion pipeline
Connection failures: Retried automatically with exponential backoff
SQL failures: Logged with full error details for debugging

Reconciliation

DuckLake reconciliation ensures that all files in storage are properly registered in the catalog. This is useful for:

Startup recovery: Register files that may have been missed during downtime
Periodic maintenance: Catch any registration failures that weren't retried
Manual verification: Ensure storage and catalog are synchronized

Reconciliation Configuration

yaml

ducklake:
  - name: my_ducklake
    # ... other configuration
    reconciliation:
      on_startup: true # Run reconciliation when application starts
      interval_minutes: 60 # Check for missing files every hour
      max_concurrent_registrations: 10 # Parallel registration limit

Reconciliation Process

List files in the storage backend's data path
Check catalog for each file's registration status
Register missing files using CALL ducklake_add_data_files()
Log results for monitoring and debugging

Monitoring and Troubleshooting

Logging

DuckLake operations are logged with detailed information:

INFO ducklake: Registering file with catalog my_ducklake: s3://bucket/path/file.parquet
INFO ducklake: Successfully registered file with catalog my_ducklake
WARN ducklake: Failed to register file with catalog my_ducklake: Table does not exist

Common Issues

Connection Failures:

Verify PostgreSQL connection parameters in the attach SQL
Check network connectivity between BoilStream and PostgreSQL
Ensure DuckLake extension is installed and loaded

Permission Errors:

Verify S3 access credentials and permissions
Check PostgreSQL user permissions for catalog operations
Ensure DuckLake schema exists and is accessible

Schema Mismatches:

Verify that data paths match between storage backends and DuckLake configuration
Check that table schemas are compatible with DuckLake requirements
Ensure Parquet files are valid and readable

Performance Issues:

Adjust max_concurrent_registrations based on your PostgreSQL capacity
Monitor reconciliation intervals to avoid overwhelming the catalog
Consider separating high-volume topics into different catalogs

Environment Variables

DuckLake configuration supports environment variable substitution:

yaml

ducklake:
  - name: my_ducklake
    data_path: "s3://${S3_BUCKET}/"
    attach: |
      CREATE OR REPLACE SECRET postgres (
        TYPE POSTGRES,
        HOST '${POSTGRES_HOST}',
        PORT ${POSTGRES_PORT},
        DATABASE '${POSTGRES_DB}',
        USER '${POSTGRES_USER}',
        PASSWORD '${POSTGRES_PASSWORD}'
      );

Set the environment variables:

bash

export S3_BUCKET=my-data-bucket
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
export POSTGRES_DB=ducklake_catalog
export POSTGRES_USER=ducklake_user
export POSTGRES_PASSWORD=secret_password

Best Practices

Use environment variables for sensitive configuration like passwords
Configure reconciliation appropriately for your data volume and requirements
Monitor DuckLake logs for registration failures and performance issues
Test catalog connectivity before deploying to production
Use separate catalogs for different use cases (analytics vs operational)
Backup your catalog database regularly since it contains critical metadata
Consider PostgreSQL performance when setting concurrent registration limits

DuckLake Integration ​

Overview ​

Configuration ​

Basic Configuration ​

Configuration Fields ​

Storage Backend Integration ​

Setup Examples ​

MinIO + PostgreSQL Catalog ​

AWS S3 + RDS PostgreSQL ​

Multi-Catalog Setup ​

File Registration ​

Registration Process ​

Error Handling ​

Reconciliation ​

Reconciliation Configuration ​

Reconciliation Process ​

Monitoring and Troubleshooting ​

Logging ​

Common Issues ​

Environment Variables ​

Best Practices ​

DuckLake Integration

Overview

Configuration

Basic Configuration

Configuration Fields

Storage Backend Integration

Setup Examples

MinIO + PostgreSQL Catalog

AWS S3 + RDS PostgreSQL

Multi-Catalog Setup

File Registration

Registration Process

Error Handling

Reconciliation

Reconciliation Configuration

Reconciliation Process

Monitoring and Troubleshooting

Logging

Common Issues

Environment Variables

Best Practices