Full-Text Search (Tantivy)
BoilStream integrates Tantivy, a high-performance full-text search engine, directly into the streaming ingestion pipeline. When enabled on a streaming table, every inserted row is automatically indexed for full-text search — no external search infrastructure required.
How It Works
Full-text search uses a two-tier architecture:
- Hot tier — Local disk indexes that are searchable within seconds of ingestion
- Cold tier — Packed segment bundles uploaded to S3 and registered in DuckLake for durable, distributed search
A shadow DuckLake table (named {table}__tantivy_idx) tracks all cold tier bundles alongside your regular Parquet data files.
Enabling Full-Text Search
Enable tantivy indexing on any streaming table with ALTER TABLE:
ALTER TABLE my_catalog__stream.main.documents SET (
tantivy_enabled = true,
tantivy_text_fields = 'title,body'
);| Parameter | Description |
|---|---|
tantivy_enabled | true or false — toggles indexing for this table |
tantivy_text_fields | Comma-separated column names to tokenize for full-text search. If empty, all string columns are stored as exact-match only (no tokenization). |
parquet_enabled | true or false — toggles S3 Parquet persistence for this table (default: true) |
WARNING
At least one of tantivy_enabled or parquet_enabled must remain true. Disabling both is rejected with an error.
When you enable tantivy on a table, BoilStream:
- Derives a tantivy schema from the table's Arrow schema
- Creates a shadow DuckLake table named
{table_name}__tantivy_idx - Begins indexing all subsequent inserts
To disable indexing:
ALTER TABLE my_catalog__stream.main.documents SET (
tantivy_enabled = false
);Tantivy-Only Mode
You can disable Parquet persistence and use tantivy as the sole storage backend for a table:
ALTER TABLE my_catalog__stream.main.documents SET (
tantivy_enabled = true,
tantivy_text_fields = 'title,body',
parquet_enabled = false
);In this mode, data is indexed into tantivy's hot tier (local disk) and cold tier (S3 bundles) without producing Parquet files. Durability acks are sent from the tantivy sink directly, so ingestion latency is not affected. This is useful for workloads where full-text search is the primary query pattern and columnar analytics are not needed.
Querying
multilake_search()
Use the multilake_search() table function to run full-text queries against the shadow tantivy table:
SELECT * FROM multilake_search(
'my_catalog__stream', -- catalog name
'documents__tantivy_idx', -- shadow table name
'distributed systems' -- search query
);The function returns all columns from matching rows plus a _score relevance column. An optional 4th parameter limits the number of results:
-- Return at most 10 results
SELECT * FROM multilake_search('my_catalog__stream', 'documents__tantivy_idx', 'Rust', 10);Query Syntax
Queries support field-prefixed syntax to target specific indexed columns:
-- Search the "title" field only
SELECT * FROM multilake_search('my_catalog__stream', 'documents__tantivy_idx', 'title:Rust');
-- Search "body" for a phrase
SELECT * FROM multilake_search('my_catalog__stream', 'documents__tantivy_idx', 'body:distributed systems');
-- Unqualified terms search across all TEXT fields
SELECT * FROM multilake_search('my_catalog__stream', 'documents__tantivy_idx', 'performance');Example: Search with Relevance Ranking
SELECT id, title, body, _score
FROM multilake_search('my_catalog__stream', 'documents__tantivy_idx', 'title:Rust')
ORDER BY _score DESC;Column Type Support
BoilStream automatically maps Arrow column types to tantivy field types:
| Arrow Type | Tantivy Type | Index Mode | Notes |
|---|---|---|---|
Utf8 / LargeUtf8 (in text_fields) | TEXT | Tokenized, stored | Full-text searchable |
Utf8 / LargeUtf8 (not in text_fields) | STRING | Exact match, fast columnar | Not tokenized |
Int8 / Int16 / Int32 / Int64 | I64 | Indexed, fast | Numeric range queries |
UInt8 / UInt16 / UInt32 / UInt64 | U64 | Indexed, fast | Numeric range queries |
Float32 / Float64 | F64 | Indexed, fast | Numeric range queries |
Timestamp | Date | Indexed, fast | Date range queries |
Boolean | Bool | Indexed, fast | Boolean filter |
Binary, List, Struct, etc. | — | Skipped | Not indexed |
The tantivy_text_fields parameter controls which string columns are tokenized for full-text search versus stored as exact-match strings.
Architecture
Hot Tier
The hot tier maintains local tantivy indexes on disk for low-latency indexing:
- Sharded writers — Each topic gets multiple writer threads (default: 4) for parallel indexing
- Memory arena — Configurable per-shard memory buffer (default: 50 MB)
- Commit cycle — Segments are committed at a configurable interval (default: 30 seconds)
- Data path —
{data_dir}/{catalog_id}/{schema_name}/{table_name}/
After each commit, the segment is handed off to the cold writer for S3 upload.
Cold Tier (S3)
The cold writer packs committed segments into .bundle files and uploads them to S3:
- Pack — Tantivy segment files are packed into a single
.bundlearchive - Upload — Bundle is uploaded to S3 with retries (default: up to 10 retries)
- Register — Bundle is registered as a tantivy-format data file in DuckLake
- Cleanup — After cold confirmation, old hot segments are cleaned up (keeping the most recent ones locally)
S3 path format:
{table_data_path}/tantivy/{YYYY-MM-DD}/{segment-uuid}.bundleSegment retention: After a segment is confirmed in S3, it remains in the hot tier until it falls outside the hot_segments_to_keep window (default: 2 most recent segments kept).
Configuration
All tantivy settings are under the tantivy section of your BoilStream config:
| Setting | Default | Description |
|---|---|---|
enabled | false | Global toggle for the tantivy subsystem |
data_dir | ./data/tantivy | Local directory for hot tier indexes |
arena_mb | 50 | Per-shard memory arena in MB |
commit_interval_secs | 30 | Seconds between segment commits |
writer_threads_per_topic | 4 | Number of writer shards per topic |
hot_segments_to_keep | 2 | Hot segments retained after cold confirmation |
s3_upload_enabled | true | Enable cold tier upload to S3 |
bundle_buffer_pool_size | 8 | Pool buffers for segment bundle packing |
bundle_buffer_size_mb | 64 | Initial buffer size per bundle in MB |
channel_capacity | 32 | Bounded channel capacity for TantivySink |
upload_retry_max | 10 | Maximum S3 upload retries |
upload_retry_base_delay_ms | 1000 | Retry backoff base delay in milliseconds |
enable_for_all_topics | false | Enable tantivy for all provisioned topics |
Example YAML:
tantivy:
enabled: true
data_dir: "./data/tantivy"
arena_mb: 100
commit_interval_secs: 15
s3_upload_enabled: trueUse Cases
- Log search — Index application logs for fast error investigation across cold storage
- Document search — Full-text search over ingested documents, articles, or support tickets
- Product catalog — Search product descriptions and metadata in real-time
- Event analysis — Find specific events across large streaming datasets by content
Limitations
WARNING
- Streaming tables only — Full-text search is only available on tables in
__streamDuckLake catalogs. - No automatic schema migration — Changing
tantivy_text_fieldsafter initial setup does not re-index existing data. Only new inserts use the updated field configuration. - Commit latency — Data becomes searchable after the next commit cycle (default: 30 seconds from insertion).
- Cargo feature gate — The tantivy subsystem requires the
tantivyCargo feature to be enabled at build time.
Next Steps
- Learn about Materialized Views for real-time streaming transformations
- Explore the DuckLake Integration for catalog management
- Check out Configuration for full server settings