Full-Text Search (Tantivy)

BoilStream integrates Tantivy, a high-performance full-text search engine, directly into the streaming ingestion pipeline. When enabled on a streaming table, every inserted row is automatically indexed for full-text search — no external search infrastructure required.

How It Works

Full-text search uses a two-tier architecture:

Hot tier — Local disk indexes that are searchable within seconds of ingestion
Cold tier — Packed segment bundles uploaded to S3 and registered in DuckLake for durable, distributed search

A shadow DuckLake table (named {table}__tantivy_idx) tracks all cold tier bundles alongside your regular Parquet data files.

Enabling Full-Text Search

Enable tantivy indexing on any streaming table with ALTER TABLE:

sql

ALTER TABLE my_catalog__stream.main.documents SET (
  tantivy_enabled = true,
  tantivy_text_fields = 'title,body'
);

Parameter	Description
`tantivy_enabled`	`true` or `false` — toggles indexing for this table
`tantivy_text_fields`	Comma-separated column names to tokenize for full-text search. If empty, all string columns are stored as exact-match only (no tokenization).
`parquet_enabled`	`true` or `false` — toggles S3 Parquet persistence for this table (default: `true`)

WARNING

At least one of tantivy_enabled or parquet_enabled must remain true. Disabling both is rejected with an error.

When you enable tantivy on a table, BoilStream:

Derives a tantivy schema from the table's Arrow schema
Creates a shadow DuckLake table named {table_name}__tantivy_idx
Begins indexing all subsequent inserts

To disable indexing:

sql

ALTER TABLE my_catalog__stream.main.documents SET (
  tantivy_enabled = false
);

Tantivy-Only Mode

You can disable Parquet persistence and use tantivy as the sole storage backend for a table:

sql

ALTER TABLE my_catalog__stream.main.documents SET (
  tantivy_enabled = true,
  tantivy_text_fields = 'title,body',
  parquet_enabled = false
);

In this mode, data is indexed into tantivy's hot tier (local disk) and cold tier (S3 bundles) without producing Parquet files. Durability acks are sent from the tantivy sink directly, so ingestion latency is not affected. This is useful for workloads where full-text search is the primary query pattern and columnar analytics are not needed.

Querying

multilake_search()

Use the multilake_search() table function to run full-text queries against the shadow tantivy table:

sql

SELECT * FROM multilake_search(
  'my_catalog__stream',           -- catalog name
  'documents__tantivy_idx',       -- shadow table name
  'distributed systems'           -- search query
);

The function returns all columns from matching rows plus a _score relevance column. An optional 4th parameter limits the number of results:

sql

-- Return at most 10 results
SELECT * FROM multilake_search('my_catalog__stream', 'documents__tantivy_idx', 'Rust', 10);

Query Syntax

Queries support field-prefixed syntax to target specific indexed columns:

sql

-- Search the "title" field only
SELECT * FROM multilake_search('my_catalog__stream', 'documents__tantivy_idx', 'title:Rust');

-- Search "body" for a phrase
SELECT * FROM multilake_search('my_catalog__stream', 'documents__tantivy_idx', 'body:distributed systems');

-- Unqualified terms search across all TEXT fields
SELECT * FROM multilake_search('my_catalog__stream', 'documents__tantivy_idx', 'performance');

Example: Search with Relevance Ranking

sql

SELECT id, title, body, _score
FROM multilake_search('my_catalog__stream', 'documents__tantivy_idx', 'title:Rust')
ORDER BY _score DESC;

Column Type Support

BoilStream automatically maps Arrow column types to tantivy field types:

Arrow Type	Tantivy Type	Index Mode	Notes
`Utf8` / `LargeUtf8` (in `text_fields`)	TEXT	Tokenized, stored	Full-text searchable
`Utf8` / `LargeUtf8` (not in `text_fields`)	STRING	Exact match, fast columnar	Not tokenized
`Int8` / `Int16` / `Int32` / `Int64`	I64	Indexed, fast	Numeric range queries
`UInt8` / `UInt16` / `UInt32` / `UInt64`	U64	Indexed, fast	Numeric range queries
`Float32` / `Float64`	F64	Indexed, fast	Numeric range queries
`Timestamp`	Date	Indexed, fast	Date range queries
`Boolean`	Bool	Indexed, fast	Boolean filter
`Binary`, `List`, `Struct`, etc.	—	Skipped	Not indexed

The tantivy_text_fields parameter controls which string columns are tokenized for full-text search versus stored as exact-match strings.

Architecture

Hot Tier

The hot tier maintains local tantivy indexes on disk for low-latency indexing:

Sharded writers — Each topic gets multiple writer threads (default: 4) for parallel indexing
Memory arena — Configurable per-shard memory buffer (default: 50 MB)
Commit cycle — Segments are committed at a configurable interval (default: 30 seconds)
Data path — {data_dir}/{catalog_id}/{schema_name}/{table_name}/

After each commit, the segment is handed off to the cold writer for S3 upload.

Cold Tier (S3)

The cold writer packs committed segments into .bundle files and uploads them to S3:

Pack — Tantivy segment files are packed into a single .bundle archive
Upload — Bundle is uploaded to S3 with retries (default: up to 10 retries)
Register — Bundle is registered as a tantivy-format data file in DuckLake
Cleanup — After cold confirmation, old hot segments are cleaned up (keeping the most recent ones locally)

S3 path format:

{table_data_path}/tantivy/{YYYY-MM-DD}/{segment-uuid}.bundle

Segment retention: After a segment is confirmed in S3, it remains in the hot tier until it falls outside the hot_segments_to_keep window (default: 2 most recent segments kept).

Configuration

All tantivy settings are under the tantivy section of your BoilStream config:

Setting	Default	Description
`enabled`	`false`	Global toggle for the tantivy subsystem
`data_dir`	`./data/tantivy`	Local directory for hot tier indexes
`arena_mb`	`50`	Per-shard memory arena in MB
`commit_interval_secs`	`30`	Seconds between segment commits
`writer_threads_per_topic`	`4`	Number of writer shards per topic
`hot_segments_to_keep`	`2`	Hot segments retained after cold confirmation
`s3_upload_enabled`	`true`	Enable cold tier upload to S3
`bundle_buffer_pool_size`	`8`	Pool buffers for segment bundle packing
`bundle_buffer_size_mb`	`64`	Initial buffer size per bundle in MB
`channel_capacity`	`32`	Bounded channel capacity for TantivySink
`upload_retry_max`	`10`	Maximum S3 upload retries
`upload_retry_base_delay_ms`	`1000`	Retry backoff base delay in milliseconds
`enable_for_all_topics`	`false`	Enable tantivy for all provisioned topics

Example YAML:

yaml

tantivy:
  enabled: true
  data_dir: "./data/tantivy"
  arena_mb: 100
  commit_interval_secs: 15
  s3_upload_enabled: true

Use Cases

Log search — Index application logs for fast error investigation across cold storage
Document search — Full-text search over ingested documents, articles, or support tickets
Product catalog — Search product descriptions and metadata in real-time
Event analysis — Find specific events across large streaming datasets by content

Limitations

WARNING

Streaming tables only — Full-text search is only available on tables in __stream DuckLake catalogs.
No automatic schema migration — Changing tantivy_text_fields after initial setup does not re-index existing data. Only new inserts use the updated field configuration.
Commit latency — Data becomes searchable after the next commit cycle (default: 30 seconds from insertion).
Cargo feature gate — The tantivy subsystem requires the tantivy Cargo feature to be enabled at build time.

Next Steps

Learn about Materialized Views for real-time streaming transformations
Explore the DuckLake Integration for catalog management
Check out Configuration for full server settings

Full-Text Search (Tantivy) ​

How It Works ​

Enabling Full-Text Search ​

Tantivy-Only Mode ​

Querying ​

multilake_search() ​

Query Syntax ​

Example: Search with Relevance Ranking ​

Column Type Support ​

Architecture ​

Hot Tier ​

Cold Tier (S3) ​

Configuration ​

Use Cases ​

Limitations ​

Next Steps ​