Skip to content

Configuration

BoilStream supports flexible configuration through YAML files and environment variables. This page covers all configuration options and how to use them.

Configuration Loading Priority

Configuration is loaded in the following priority order (highest to lowest):

  1. Command line --config file
  2. CONFIG_FILE environment variable
  3. Built-in defaults (if no config file specified)

Environment variables always override YAML file settings regardless of the config file source.

Specifying Configuration Files

Command Line Option

bash
# Specify config file via command line
boilstream --config my-config.yaml

# Alternative syntax
boilstream --config=my-config.yaml

Environment Variable

bash
# Specify config file via environment variable
CONFIG_FILE=my-config.yaml boilstream

No Configuration File

bash
# Use only built-in defaults + environment variables
S3_BUCKET=my-bucket boilstream

YAML Configuration Format

Here's a complete example configuration file:

yaml
# AWS Configuration
aws:
  region: "eu-west-1"
  # access_key_id: "your-access-key"  # Optional - can use AWS CLI/IAM roles
  # secret_access_key: "your-secret-key"  # Optional - can use AWS CLI/IAM roles
  https_conn_pool_size: 100

# DuckDB Persistence Configuration (Optional High-Performance Local Storage)
duckdb_persistence:
  enabled: true # Enable high-performance local DuckDB persistence (10M+ rows/s)
  storage_path: "/tmp/duckdb/topics" # Directory for shared DuckDB database files
  max_writers: 10 # Number of concurrent database writers for optimal performance

# Storage Configuration
storage:
  # Multiple storage backends can be configured simultaneously
  backends:
    - name: "primary-s3"
      backend_type: "s3"
      enabled: true
      primary: true # Primary backend - operations must succeed here
      # S3-specific configuration
      endpoint: "http://localhost:9000" # For MinIO/custom S3
      bucket: "ingestion-data"
      prefix: "/"
      access_key: "minioadmin"
      secret_key: "minioadmin"
      region: "us-east-1"
      use_path_style: true # Required for MinIO
      max_concurrent_uploads: 10
      upload_id_pool_capacity: 100
      max_retries: 3
      initial_backoff_ms: 100
      max_retry_attempts: 3
      flush_interval_ms: 250
      max_multipart_object_size: 104857600 # 100 MB
    - name: "backup-filesystem"
      backend_type: "filesystem"
      enabled: true
      primary: false # Secondary backend - failures are logged but not fatal
      # Filesystem-specific configuration
      prefix: "/tmp/storage"
    # - name: "debug-noop"
    #   backend_type: "noop"
    #   enabled: false
    #   primary: false  # For testing/benchmarking without actual storage

# Server Configuration
server:
  # tokio_worker_threads: 16  # Optional - defaults to system CPU count
  flight_thread_count: 1
  flight_base_port: 50050
  admin_flight_port: 50160
  consumer_flight_port: 50250
  valkey_url: "redis://localhost:6379"

# Data Processing Configuration
processing:
  data_processing_threads: 8
  buffer_pool_max_size: 50
  window_queue_capacity: 30000
  window_ms: 10000
  include_metadata_columns: true
  schema_validation_enabled: true
  parquet:
    compression: "ZSTD"
    dictionary_enabled: true

# Rate Limiting Configuration
rate_limiting:
  disabled: false
  max_requests: 15000000
  burst_limit: 20000000
  global_limit: 150000000
  base_size_bytes: 4096

# TLS Configuration
tls:
  disabled: true # Disabled for development
  # cert_path: "/path/to/cert.pem"
  # key_path: "/path/to/key.pem"
  # cert_pem: "-----BEGIN CERTIFICATE-----\n..."
  # key_pem: "-----BEGIN PRIVATE KEY-----\n..."
  # grpc_default_ssl_roots_file_path: "/path/to/ca-certificates.crt"

# Authentication Configuration
auth:
  providers: [] # Empty for development - no auth
  authorization_enabled: false
  admin_groups: []
  read_only_groups: []
  cognito:
    # user_pool_id: "us-east-1_example"
    # region: "us-east-1"
    # audience: "client-id"
  azure:
    # tenant_id: "tenant-id"
    # client_id: "client-id"
    allow_multi_tenant: false
  gcp:
    # client_id: "client-id"
    # project_id: "project-id"
    require_workspace_domain: false
  auth0:
    # tenant: "your-tenant.auth0.com"
    # audience: "your-api-identifier"
    # groups_namespace: "https://your-app.com/groups"
  okta:
    # org_domain: "your-org.okta.com"
    # audience: "api://your-audience"
    # auth_server_id: "your-auth-server"

# Metrics Configuration
metrics:
  port: 8081
  flush_interval_ms: 1000

# Logging Configuration
logging:
  rust_log: "info"

Configuration Sections

AWS Configuration

Configure AWS credentials for S3 backends:

FieldTypeDefaultDescription
aws.regionstring"us-east-1"AWS region
aws.access_key_idstringnullAWS access key (optional)
aws.secret_access_keystringnullAWS secret key (optional)
aws.https_conn_pool_sizenumber100HTTP connection pool size

Note: S3 configuration is now done per-backend in the storage.backends section.

DuckDB Persistence Configuration

BoilStream provides optional high-performance local DuckDB persistence alongside its diskless S3 pipeline. When enabled, data is simultaneously written to both S3 (diskless) and local DuckDB databases (shared across topics).

FieldTypeDefaultDescription
duckdb_persistence.enabledbooleanfalseEnable DuckDB persistence (10M+ rows/s)
duckdb_persistence.storage_pathstring"/tmp/duckdb/topics"Directory for shared DuckDB database files
duckdb_persistence.max_writersnumber10Number of concurrent database writers
duckdb_persistence.dry_runbooleanfalseProcess Arrow data but skip actual writes
duckdb_persistence.super_dry_runbooleanfalseCompletely skip DuckDB processing

DuckDB Persistence Benefits

High Performance:

  • 10+ million rows/second ingestion rate into local databases
  • Shared database files allow cross-topic queries and joins
  • No backup infrastructure needed - S3 provides automatic replication

Architecture Integration:

  • Roadmap: Source for window queries and time-series analysis
  • Roadmap: FlightSQL integration for direct BI tool connectivity
  • Roadmap: Live queries over past N hours of ingested data

Example Configuration:

yaml
# High-performance dual storage: S3 (diskless) + DuckDB (local)
duckdb_persistence:
  enabled: true
  storage_path: "/data/duckdb"
  max_writers: 16 # Scale with CPU cores

# Continues to write to S3 backends simultaneously
storage:
  backends:
    - name: "primary-s3"
      backend_type: "s3"
      enabled: true
      # ... S3 configuration

DuckLake Configuration

BoilStream integrates with DuckLake. DuckLake automatically registers Parquet files in catalogs after successful upload to storage.

FieldTypeDefaultDescription
ducklake[].namestring-Unique identifier for this DuckLake catalog
ducklake[].data_pathstring-S3 path where Parquet files are stored
ducklake[].attachstring-SQL statements for DuckLake attachment and setup
ducklake[].topicsarrayall topicsOptional: Specify which topics to include
ducklake[].reconciliation.on_startupbooleantrueRun reconciliation when application starts
ducklake[].reconciliation.interval_minutesnumber60Check for missing files every N minutes
ducklake[].reconciliation.max_concurrent_registrationsnumber10Parallel registration limit

Example Configuration:

yaml
ducklake:
  - name: my_ducklake
    data_path: "s3://ingestion-data/"
    attach: |
      INSTALL ducklake; INSTALL postgres; INSTALL aws;
      LOAD ducklake; LOAD postgres; LOAD aws;
      CREATE SECRET s3_access (TYPE S3, KEY_ID 'key', SECRET 'secret');
      CREATE SECRET postgres (TYPE POSTGRES, HOST 'localhost', DATABASE 'catalog');
      CREATE SECRET pg_secret (TYPE DUCKLAKE, DATA_PATH 's3://ingestion-data/', 
                               METADATA_PARAMETERS MAP {'TYPE': 'postgres', 'SECRET': 'postgres'});
      ATTACH 'ducklake:pg_secret' AS my_ducklake;
    reconciliation:
      on_startup: true
      interval_minutes: 60

DuckLake Integration with Storage Backends:

Storage backends can automatically register files with DuckLake catalogs:

yaml
storage:
  backends:
    - name: "primary-s3"
      backend_type: "s3"
      # ... S3 configuration
      ducklake: ["my_ducklake"] # Auto-register files with this catalog

See the DuckLake Integration guide for detailed setup instructions.

Storage Configuration

BoilStream supports multiple concurrent storage backends, allowing you to write data to several destinations simultaneously. This enables scenarios like:

  • Primary + backup storage (S3 + filesystem)
  • Multi-cloud redundancy (S3 + another cloud provider)
  • Testing and auditing (production storage + debug/noop storage)
  • DuckLake integration (automatic catalog registration)

Storage Backends

Configure multiple storage backends in the storage.backends array:

FieldTypeDefaultDescription
storage.backends[].namestringUnique identifier for this backend
storage.backends[].backend_typestringBackend type: "s3", "filesystem", "noop"
storage.backends[].enabledbooleanWhether this backend is active
storage.backends[].primarybooleanIf true, operations must succeed on this backend
storage.backends[].ducklakearray[]List of DuckLake catalogs to register files with

Backend Types:

  • s3 - AWS S3 or S3-compatible storage (MinIO, etc.)
  • filesystem - Local or network filesystem storage
  • noop - No-operation storage for testing/benchmarking

Primary vs Secondary Backends:

  • Primary backends (primary: true) must succeed for the operation to be considered successful
  • Secondary backends (primary: false) are best-effort; failures are logged but don't fail the operation

Backend-Specific Configuration

S3 Backend Configuration:

FieldTypeDefaultDescription
endpointstringnullS3 endpoint URL (required for S3 backends)
bucketstringnullS3 bucket name (required for S3 backends)
prefixstring""Base prefix for S3 uploads (optional)
access_keystringnullS3 access key (required for S3 backends)
secret_keystringnullS3 secret key (required for S3 backends)
regionstring"us-east-1"AWS region (optional for S3 backends)
use_path_stylebooleanauto-detectedUse path-style addressing (auto-detects MinIO)
max_concurrent_uploadsnumber10Maximum concurrent uploads
upload_id_pool_capacitynumber100Upload ID pool capacity
max_retriesnumber3Maximum retry attempts
initial_backoff_msnumber100Initial backoff in milliseconds
max_retry_attemptsnumber3Maximum retry attempts
flush_interval_msnumber250Data sync interval in milliseconds
max_multipart_object_sizenumber104857600Maximum multipart object size (100MB)

Filesystem Backend Configuration:

FieldTypeDefaultDescription
prefixstring"./storage"Base directory path prefix for filesystem storage (required for filesystem backends)

MinIO Configuration

MinIO is supported through the S3 backend type. To configure MinIO, use backend_type: "s3" with these specific settings:

yaml
storage:
  backends:
    - name: "minio-storage"
      backend_type: "s3" # Use S3 backend type for MinIO
      enabled: true
      primary: true
      endpoint: "http://localhost:9000" # MinIO endpoint
      bucket: "your-bucket-name"
      prefix: "/"
      access_key: "minioadmin"
      secret_key: "minioadmin"
      region: "us-east-1"
      use_path_style: true # Required for MinIO

MinIO-specific notes:

  • Always set use_path_style: true for MinIO compatibility
  • Use backend_type: "s3" (not a separate MinIO type)
  • The system automatically detects MinIO endpoints and sets path-style addressing if not explicitly configured

Environment Variable Configuration

You can configure multiple backends via the STORAGE_BACKENDS environment variable:

bash
# Enable S3 and filesystem backends (S3 primary, filesystem secondary)
STORAGE_BACKENDS="s3,filesystem" boilstream

# Enable only filesystem storage
STORAGE_BACKENDS="filesystem" boilstream

# Enable S3, filesystem, and noop for testing
STORAGE_BACKENDS="s3,filesystem,noop" boilstream

Example Configurations

Primary S3 + Backup Filesystem:

yaml
storage:
  backends:
    - name: "primary-s3"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "https://s3.amazonaws.com"
      bucket: "my-production-bucket"
      prefix: ""
      access_key: "${AWS_ACCESS_KEY_ID}"
      secret_key: "${AWS_SECRET_ACCESS_KEY}"
      region: "us-east-1"
      use_path_style: false
    - name: "backup-filesystem"
      backend_type: "filesystem"
      enabled: true
      primary: false
      prefix: "/backup/storage"

MinIO Development Setup:

yaml
storage:
  backends:
    - name: "minio-dev"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "http://localhost:9000"
      bucket: "ingestion-data"
      prefix: "/"
      access_key: "minioadmin"
      secret_key: "minioadmin"
      region: "us-east-1"
      use_path_style: true # Required for MinIO

Development with NoOp for Performance Testing:

yaml
storage:
  backends:
    - name: "main-s3"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "http://localhost:9000"
      bucket: "test-bucket"
      prefix: ""
      access_key: "minioadmin"
      secret_key: "minioadmin"
      use_path_style: true
    - name: "perf-test"
      backend_type: "noop"
      enabled: true
      primary: false

Filesystem Only (Local Development):

yaml
storage:
  backends:
    - name: "local-dev"
      backend_type: "filesystem"
      enabled: true
      primary: true
      prefix: "./local-storage"

Server Configuration

Configure server ports and threading:

FieldTypeDefaultDescription
server.tokio_worker_threadsnumbernullNumber of Tokio worker threads
server.flight_thread_countnumber1Number of FlightRPC threads
server.flight_base_portnumber50050Base port for FlightRPC servers
server.admin_flight_portnumber50160Admin service port
server.consumer_flight_portnumber50250Consumer service port
server.valkey_urlstring"redis://localhost:6379"Valkey/Redis connection URL

Processing Configuration

Configure data processing behavior:

FieldTypeDefaultDescription
processing.data_processing_threadsnumber8Number of data processing threads
processing.buffer_pool_max_sizenumber50Maximum buffer pool size
processing.window_queue_capacitynumber30000Window queue capacity
processing.window_msnumber10000Window duration in milliseconds
processing.dry_runbooleanfalseEnable dry run mode
processing.include_metadata_columnsbooleantrueInclude metadata columns
processing.schema_validation_enabledbooleantrueEnable schema validation

Parquet Configuration

FieldTypeDefaultDescription
processing.parquet.compressionstring"ZSTD"Parquet compression algorithm
processing.parquet.dictionary_enabledbooleantrueEnable dictionary encoding

Rate Limiting Configuration

Configure request rate limiting:

FieldTypeDefaultDescription
rate_limiting.disabledbooleanfalseDisable rate limiting
rate_limiting.max_requestsnumber15000000Max requests per second per producer
rate_limiting.burst_limitnumber20000000Burst limit
rate_limiting.global_limitnumber150000000Global requests per second
rate_limiting.base_size_bytesnumber4096Base size for rate limiting tokens

TLS Configuration

Configure TLS encryption (Pro tier only):

FieldTypeDefaultDescription
tls.disabledbooleanfalseDisable TLS
tls.cert_pathstringnullPath to certificate file
tls.key_pathstringnullPath to private key file
tls.cert_pemstringnullCertificate as PEM string
tls.key_pemstringnullPrivate key as PEM string

Authentication Configuration

Configure authentication providers (Pro tier only):

FieldTypeDefaultDescription
auth.providersarray[]List of authentication providers
auth.authorization_enabledbooleanfalseEnable authorization
auth.admin_groupsarray[]Admin group names
auth.read_only_groupsarray[]Read-only group names

See the Authentication & Authorization section for detailed provider configuration.

Metrics Configuration

Configure metrics collection:

FieldTypeDefaultDescription
metrics.portnumber8081Metrics server port
metrics.flush_interval_msnumber1000Metrics flush interval

PGWire Server Configuration

BoilStream includes a built-in PostgreSQL wire protocol server that enables BI tools like DBeaver, Tableau, and psql to connect directly to your streaming data through the standard PostgreSQL protocol.

FieldTypeDefaultDescription
pgwire.enabledbooleantrueEnable PGWire PostgreSQL protocol server
pgwire.portnumber5432Port for PostgreSQL protocol connections
pgwire.usernamestring"boilstream"Username for PostgreSQL authentication
pgwire.passwordstring"boilstream"Password for PostgreSQL authentication
pgwire.refresh_interval_secondsnumber5Database refresh interval in seconds
pgwire.initialization_sqlstring""SQL commands to execute on DuckDB init
pgwire.tls.enabledbooleanfalseEnable TLS for PostgreSQL connections (Pro tier only)
pgwire.tls.cert_pathstringnullPath to TLS certificate file (Pro tier only)
pgwire.tls.key_pathstringnullPath to TLS private key file (Pro tier only)
pgwire.tls.cert_pemstringnullTLS certificate as PEM string (Pro tier only)
pgwire.tls.key_pemstringnullTLS private key as PEM string (Pro tier only)

Key Features:

  • Full PostgreSQL Protocol Support: Compatible with any PostgreSQL client
  • Cursor Support: Handles large result sets efficiently through extended query protocol
  • Text and Binary Encoding: Supports both text and binary data formats
  • Prepared Statements: Full prepared statement support with parameter binding
  • Query Cancellation: Standard PostgreSQL query cancellation support
  • TLS Encryption: Optional TLS encryption for secure connections (Pro tier only)

Example Configuration:

yaml
# PostgreSQL Protocol Server
pgwire:
  enabled: true
  port: 5432
  username: "analyst"
  password: "secure_password"
  refresh_interval_seconds: 10
  initialization_sql: |
    INSTALL icu;
    LOAD icu;
    SET timezone = 'UTC';
  tls:
    enabled: true  # Pro tier only
    cert_path: "/etc/ssl/certs/pgwire.crt"  # Pro tier only
    key_path: "/etc/ssl/private/pgwire.key"  # Pro tier only

Integration with DuckDB Persistence:

The PGWire server automatically integrates with DuckDB persistence when enabled, providing:

  • Live Query Access: Query streaming data through PostgreSQL protocol
  • Cross-Topic Joins: Join data across different topics using standard SQL
  • BI Tool Compatibility: Connect any PostgreSQL-compatible BI tool directly

Environment Variable Overrides:

bash
# Enable PGWire server
PGWIRE_ENABLED=true
PGWIRE_PORT=5432
PGWIRE_USERNAME=analyst
PGWIRE_PASSWORD=secure_password

# TLS Configuration (Pro tier only)
PGWIRE_TLS_ENABLED=true
PGWIRE_TLS_CERT_PATH=/etc/ssl/certs/pgwire.crt
PGWIRE_TLS_KEY_PATH=/etc/ssl/private/pgwire.key

# Or use PEM strings directly (Pro tier only)
PGWIRE_TLS_CERT_PEM="-----BEGIN CERTIFICATE-----..."
PGWIRE_TLS_KEY_PEM="-----BEGIN PRIVATE KEY-----..."

See the PGWire Server Guide for detailed setup instructions and BI tool integration examples.

Logging Configuration

Configure logging levels:

FieldTypeDefaultDescription
logging.rust_logstring"info"Log level configuration

Environment Variable Override

All YAML configuration fields can be overridden with environment variables. The environment variable names follow this pattern:

  • Nested fields are joined with underscores
  • All uppercase
  • Boolean values: "true", "false", "1", "0"
  • Arrays: comma-separated values

Examples

bash
# Override AWS region
AWS_REGION=us-west-2

# Override S3 bucket
S3_BUCKET=my-production-bucket

# Override server port
FLIGHT_BASE_PORT=8080

# Override processing settings
DATA_PROCESSING_THREADS=16
INCLUDE_METADATA_COLUMNS=false

# Override storage backends (comma-separated)
STORAGE_BACKENDS=s3,filesystem
STORAGE_FILESYSTEM_PREFIX=/data/storage

# Override Valkey/Redis connection
VALKEY_URL=redis://production-redis:6379

# Override authentication providers (comma-separated)
AUTH_PROVIDERS=cognito,azure
ADMIN_GROUPS=admin,superuser

Development vs Production

Development Configuration

For local development, create a dev-config.yaml:

yaml
aws:
  region: "us-east-1"
  s3:
    bucket: "my-dev-bucket"

storage:
  backends:
    - name: "local-filesystem"
      backend_type: "filesystem"
      enabled: true
      primary: true
      prefix: "./dev-storage"
    # Optional: Add S3 for testing cloud integration
    # - name: "dev-s3"
    #   backend_type: "s3"
    #   enabled: false
    #   primary: false
    #   endpoint: "http://localhost:9000"
    #   bucket: "dev-bucket"
    #   prefix: ""
    #   access_key: "minioadmin"
    #   secret_key: "minioadmin"
    #   use_path_style: true

server:
  tokio_worker_threads: 16
  valkey_url: "redis://localhost:6379"

tls:
  disabled: true

auth:
  providers: []

logging:
  rust_log: "debug"

Production Configuration

For production, create a prod-config.yaml:

yaml
aws:
  region: "eu-west-1"

storage:
  backends:
    - name: "primary-s3"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "https://s3.amazonaws.com"
      bucket: "my-production-bucket"
      prefix: ""
      access_key: "${AWS_ACCESS_KEY_ID}"
      secret_key: "${AWS_SECRET_ACCESS_KEY}"
      region: "eu-west-1"
      use_path_style: false
    - name: "backup-filesystem"
      backend_type: "filesystem"
      enabled: true
      primary: false # Secondary for backup/audit
      prefix: "/data/backup-storage"

server:
  tokio_worker_threads: 16
  valkey_url: "redis://localhost:6379"

processing:
  data_processing_threads: 16
  window_queue_capacity: 100000

rate_limiting:
  max_requests: 50000000
  burst_limit: 75000000

tls:
  disabled: false
  cert_path: "/etc/ssl/certs/server.crt"
  key_path: "/etc/ssl/private/server.key"

auth:
  providers: ["cognito"]
  authorization_enabled: true
  admin_groups: ["admin"]

logging:
  rust_log: "info"

Usage Examples

Basic Development Setup

bash
# Create config file
cat > dev-config.yaml << EOF
aws:
  region: "us-east-1"
storage:
  backends:
    - name: "dev-s3"
      backend_type: "s3"
      enabled: true
      primary: true
      endpoint: "http://localhost:9000"
      bucket: "my-dev-bucket"
      prefix: ""
      access_key: "minioadmin"
      secret_key: "minioadmin"
      region: "us-east-1"
      use_path_style: true
server:
  valkey_url: "redis://localhost:6379"
logging:
  rust_log: "debug"
EOF

# Run with config file
boilstream --config dev-config.yaml

Production with Environment Overrides

bash
# Use production config but override bucket via environment
S3_BUCKET=production-bucket-2024 boilstream --config prod-config.yaml

Environment Variables Only

bash
# No config file, all via environment
AWS_REGION=eu-west-1 \
S3_BUCKET=my-bucket \
TLS_DISABLED=true \
boilstream

Multi-Backend Examples

bash
# Primary S3 + backup filesystem
STORAGE_BACKENDS="s3,filesystem" \
STORAGE_FILESYSTEM_PREFIX="/backup" \
S3_BUCKET=my-bucket \
VALKEY_URL=redis://localhost:6379 \
boilstream

# Filesystem only for local development
STORAGE_BACKENDS="filesystem" \
STORAGE_FILESYSTEM_PREFIX="./local-storage" \
VALKEY_URL=redis://localhost:6379 \
boilstream

# S3 + NoOp for performance testing
STORAGE_BACKENDS="s3,noop" \
S3_BUCKET=perf-test-bucket \
boilstream

Validation

BoilStream validates configuration on startup and will exit with an error if:

  • Required fields are missing (e.g., S3_BUCKET)
  • Invalid values are provided (e.g., port 0)
  • Referenced files don't exist (e.g., TLS certificates)

Check the logs for detailed validation error messages.