Skip to content

DuckLake Integration

DuckLake provides metadata management for your data lake. BoilStream integrates with DuckLake to automatically register Parquet files in your catalog as they are written to storage. When Parquet files are added to DuckLake, corresponding DuckLake table is created if it does not already exist. The topic name becomes the table name and table schema is inferred from the uploaded Parquet file.

Overview

DuckLake integration allows you to:

  • Automatic file registration: Parquet files are automatically registered with DuckLake catalogs after successful upload
    • NOTE: DuckDB client side SQL transactions are not the same as DuckLake catalog updates per Parquet file since BoilStream aggregates data into the Parquet files.
  • Metadata management: Centralized schema and table metadata through PostgreSQL or other catalogs
  • Multi-catalog support: Configure multiple DuckLake catalogs for different use cases
  • Reconciliation: Automatically sync missing files between storage and catalog. BoilStream does not stop if the catalog connection fails (e.g. postgres server goes down).

Configuration

DuckLake is configured in the ducklake section of your configuration file. You can configure multiple DuckLake catalogs simultaneously.

Basic Configuration

yaml
ducklake:
  - name: my_ducklake
    data_path: "s3://ingestion-data/"
    attach: |
      FROM duckdb_secrets();
      -- Load extensions
      INSTALL ducklake;
      INSTALL postgres;
      INSTALL aws;
      INSTALL httpfs;
      LOAD ducklake;
      LOAD postgres;
      LOAD aws;
      LOAD httpfs;
      -- S3 access for MinIO
      CREATE OR REPLACE SECRET secretForDirectS3Access (
          TYPE S3,
          KEY_ID 'minioadmin',
          SECRET 'minioadmin',
          REGION 'us-east-1',
          ENDPOINT 'localhost:9000',
          USE_SSL false,
          URL_STYLE 'path',
          SCOPE 's3://ingestion-data/'
      );
      -- PostgreSQL secret for DuckLake catalog backend
      CREATE OR REPLACE SECRET postgres (
        TYPE POSTGRES,
        HOST 'localhost',
        PORT 5432,
        DATABASE 'boilstream',
        USER 'postgres',
        PASSWORD 'postgres'
      );
      -- DuckLake Postgres catalog
      CREATE OR REPLACE SECRET pg_secret (
        TYPE DUCKLAKE,
        METADATA_PATH '',
        DATA_PATH 's3://ingestion-data/',
        METADATA_PARAMETERS MAP {'TYPE': 'postgres', 'SECRET': 'postgres'}
      );
      -- DuckLake attachment with Postgres catalog
      ATTACH 'ducklake:pg_secret' AS my_ducklake;
    reconciliation:
      on_startup: true
      interval_minutes: 60
      max_concurrent_registrations: 10

Configuration Fields

FieldTypeDefaultDescription
namestring-Unique identifier for this DuckLake catalog
data_pathstring-S3 path where Parquet files are stored (must match storage backend)
attachstring-SQL statements to execute for DuckLake attachment and setup
topicsarrayall topicsOptional: Specify which topics to include in this catalog
reconciliation.on_startupbooleantrueRun reconciliation when application starts
reconciliation.interval_minutesnumber60Check for missing files every N minutes
reconciliation.max_concurrent_registrationsnumber10Parallel registration limit

Storage Backend Integration

DuckLake catalogs must be configured with storage backends that have DuckLake integration enabled:

yaml
storage:
  backends:
    - name: "primary-s3"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "http://localhost:9000"
      bucket: "ingestion-data"
      # ... other S3 configuration
      # DuckLake integration - register files with these catalogs
      ducklake: ["my_ducklake"]

ducklake:
  - name: my_ducklake
    data_path: "s3://ingestion-data/"
    # ... DuckLake configuration

The ducklake field in storage backends specifies which DuckLake catalogs should automatically register files written by that backend.

Setup Examples

MinIO + PostgreSQL Catalog

yaml
# Storage backend configuration
storage:
  backends:
    - name: "primary-s3"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "http://localhost:9000"
      access_key: "minioadmin"
      secret_key: "minioadmin"
      region: "us-east-1"
      use_path_style: true
      bucket: "ingestion-data"
      prefix: "/"
      ducklake: ["my_ducklake"]

# DuckLake configuration
ducklake:
  - name: my_ducklake
    data_path: "s3://ingestion-data/"
    attach: |
      FROM duckdb_secrets();
      -- Load required extensions
      INSTALL ducklake;
      INSTALL postgres;
      INSTALL aws;
      INSTALL httpfs;
      LOAD ducklake;
      LOAD postgres;
      LOAD aws;
      LOAD httpfs;

      -- S3 access for MinIO
      CREATE OR REPLACE SECRET secretForDirectS3Access (
          TYPE S3,
          KEY_ID 'minioadmin',
          SECRET 'minioadmin',
          REGION 'us-east-1',
          ENDPOINT 'localhost:9000',
          USE_SSL false,
          URL_STYLE 'path',
          SCOPE 's3://ingestion-data/'
      );

      -- PostgreSQL secret for DuckLake catalog backend
      CREATE OR REPLACE SECRET postgres (
        TYPE POSTGRES,
        HOST 'localhost',
        PORT 5433,
        DATABASE 'boilstream',
        USER 'postgres',
        PASSWORD 'postgres'
      );

      -- DuckLake Postgres catalog
      CREATE OR REPLACE SECRET pg_secret (
        TYPE DUCKLAKE,
        METADATA_PATH '',
        DATA_PATH 's3://ingestion-data/',
        METADATA_PARAMETERS MAP {'TYPE': 'postgres', 'SECRET': 'postgres'}
      );

      -- DuckLake attachment with Postgres catalog
      ATTACH 'ducklake:pg_secret' AS my_ducklake;

    reconciliation:
      on_startup: true
      interval_minutes: 60
      max_concurrent_registrations: 10

AWS S3 + RDS PostgreSQL

yaml
storage:
  backends:
    - name: "production-s3"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "https://s3.amazonaws.com"
      bucket: "my-production-bucket"
      access_key: "${AWS_ACCESS_KEY_ID}"
      secret_key: "${AWS_SECRET_ACCESS_KEY}"
      region: "us-east-1"
      ducklake: ["production_catalog"]

ducklake:
  - name: production_catalog
    data_path: "s3://my-production-bucket/"
    attach: |
      -- Load extensions
      INSTALL ducklake;
      INSTALL postgres;
      INSTALL aws;
      LOAD ducklake;
      LOAD postgres;
      LOAD aws;

      -- AWS S3 access
      CREATE OR REPLACE SECRET s3_access (
          TYPE S3,
          KEY_ID '${AWS_ACCESS_KEY_ID}',
          SECRET '${AWS_SECRET_ACCESS_KEY}',
          REGION 'us-east-1',
          SCOPE 's3://my-production-bucket/'
      );

      -- RDS PostgreSQL catalog
      CREATE OR REPLACE SECRET rds_postgres (
        TYPE POSTGRES,
        HOST 'my-rds-instance.region.rds.amazonaws.com',
        PORT 5432,
        DATABASE 'ducklake_catalog',
        USER 'ducklake_user',
        PASSWORD '${RDS_PASSWORD}'
      );

      -- DuckLake catalog configuration
      CREATE OR REPLACE SECRET production_secret (
        TYPE DUCKLAKE,
        METADATA_PATH '',
        DATA_PATH 's3://my-production-bucket/',
        METADATA_PARAMETERS MAP {'TYPE': 'postgres', 'SECRET': 'rds_postgres'}
      );

      -- Attach DuckLake catalog
      ATTACH 'ducklake:production_secret' AS production_catalog;

    reconciliation:
      on_startup: true
      interval_minutes: 30
      max_concurrent_registrations: 20

Multi-Catalog Setup

yaml
storage:
  backends:
    - name: "main-storage"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "http://localhost:9000"
      bucket: "ingestion-data"
      # Register with both catalogs
      ducklake: ["analytics_catalog", "operational_catalog"]

ducklake:
  # Analytics catalog for BI tools
  - name: analytics_catalog
    data_path: "s3://ingestion-data/"
    # Include only analytics topics
    topics: ["events", "user_interactions", "metrics"]
    attach: |
      # ... setup for analytics PostgreSQL database
      ATTACH 'ducklake:analytics_secret' AS analytics_catalog;
    reconciliation:
      on_startup: true
      interval_minutes: 60

  # Operational catalog for real-time monitoring
  - name: operational_catalog
    data_path: "s3://ingestion-data/"
    # Include only operational topics
    topics: ["logs", "alerts", "health_checks"]
    attach: |
      # ... setup for operational PostgreSQL database
      ATTACH 'ducklake:operational_secret' AS operational_catalog;
    reconciliation:
      on_startup: true
      interval_minutes: 15

File Registration

When DuckLake integration is enabled, BoilStream automatically:

  1. Writes Parquet files to the configured storage backend
  2. Registers files with all specified DuckLake catalogs after successful upload
  3. Handles failures gracefully - catalog registration failures don't affect data ingestion
  4. Provides logging for all registration attempts and results

Registration Process

  1. File is successfully written to storage backend
  2. For each DuckLake catalog listed in the backend's ducklake field:
    • Connect to DuckLake using the attach SQL statements
    • Execute CALL ducklake_add_data_files() with the file path
    • Log success or failure

Error Handling

  • Storage failures: Prevent file registration (file must be written successfully first)
  • Catalog failures: Logged but don't affect ingestion pipeline
  • Connection failures: Retried automatically with exponential backoff
  • SQL failures: Logged with full error details for debugging

Reconciliation

DuckLake reconciliation ensures that all files in storage are properly registered in the catalog. This is useful for:

  • Startup recovery: Register files that may have been missed during downtime
  • Periodic maintenance: Catch any registration failures that weren't retried
  • Manual verification: Ensure storage and catalog are synchronized

Reconciliation Configuration

yaml
ducklake:
  - name: my_ducklake
    # ... other configuration
    reconciliation:
      on_startup: true # Run reconciliation when application starts
      interval_minutes: 60 # Check for missing files every hour
      max_concurrent_registrations: 10 # Parallel registration limit

Reconciliation Process

  1. List files in the storage backend's data path
  2. Check catalog for each file's registration status
  3. Register missing files using CALL ducklake_add_data_files()
  4. Log results for monitoring and debugging

Monitoring and Troubleshooting

Logging

DuckLake operations are logged with detailed information:

INFO ducklake: Registering file with catalog my_ducklake: s3://bucket/path/file.parquet
INFO ducklake: Successfully registered file with catalog my_ducklake
WARN ducklake: Failed to register file with catalog my_ducklake: Table does not exist

Common Issues

Connection Failures:

  • Verify PostgreSQL connection parameters in the attach SQL
  • Check network connectivity between BoilStream and PostgreSQL
  • Ensure DuckLake extension is installed and loaded

Permission Errors:

  • Verify S3 access credentials and permissions
  • Check PostgreSQL user permissions for catalog operations
  • Ensure DuckLake schema exists and is accessible

Schema Mismatches:

  • Verify that data paths match between storage backends and DuckLake configuration
  • Check that table schemas are compatible with DuckLake requirements
  • Ensure Parquet files are valid and readable

Performance Issues:

  • Adjust max_concurrent_registrations based on your PostgreSQL capacity
  • Monitor reconciliation intervals to avoid overwhelming the catalog
  • Consider separating high-volume topics into different catalogs

Environment Variables

DuckLake configuration supports environment variable substitution:

yaml
ducklake:
  - name: my_ducklake
    data_path: "s3://${S3_BUCKET}/"
    attach: |
      CREATE OR REPLACE SECRET postgres (
        TYPE POSTGRES,
        HOST '${POSTGRES_HOST}',
        PORT ${POSTGRES_PORT},
        DATABASE '${POSTGRES_DB}',
        USER '${POSTGRES_USER}',
        PASSWORD '${POSTGRES_PASSWORD}'
      );

Set the environment variables:

bash
export S3_BUCKET=my-data-bucket
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
export POSTGRES_DB=ducklake_catalog
export POSTGRES_USER=ducklake_user
export POSTGRES_PASSWORD=secret_password

Best Practices

  1. Use environment variables for sensitive configuration like passwords
  2. Configure reconciliation appropriately for your data volume and requirements
  3. Monitor DuckLake logs for registration failures and performance issues
  4. Test catalog connectivity before deploying to production
  5. Use separate catalogs for different use cases (analytics vs operational)
  6. Backup your catalog database regularly since it contains critical metadata
  7. Consider PostgreSQL performance when setting concurrent registration limits