Metrics API
BoilStream exposes Prometheus-compatible metrics for monitoring and observability.
Metrics Endpoint
GET http://localhost:8081/metrics
The metrics are exposed in Prometheus text format and can be scraped by Prometheus or compatible monitoring systems.
Key Metrics
Ingestion Metrics
Metric | Type | Description |
---|---|---|
ingestion_requests_total | Counter | Total number of ingestion requests |
ingestion_bytes_total | Counter | Total bytes ingested |
ingestion_records_total | Counter | Total records ingested |
ingestion_duration_seconds | Histogram | Request processing duration |
ingestion_errors_total | Counter | Total ingestion errors by type |
FlightRPC Metrics
Metric | Type | Description |
---|---|---|
flight_requests_total | Counter | Total FlightRPC requests |
flight_bytes_received | Counter | Bytes received via FlightRPC |
flight_duration_seconds | Histogram | FlightRPC request duration |
flight_active_connections | Gauge | Current active FlightRPC connections |
HTTP Ingestion Metrics
Metric | Type | Description |
---|---|---|
http_requests_total | Counter | Total HTTP requests |
http_request_duration_seconds | Histogram | HTTP request latency |
http_active_connections | Gauge | Current HTTP/2 connections |
http_bytes_received | Counter | Total bytes received via HTTP |
Kafka Interface Metrics
Metric | Type | Description |
---|---|---|
kafka_messages_received | Counter | Total Kafka messages received |
kafka_bytes_processed | Counter | Bytes processed from Kafka |
kafka_schema_cache_hits | Counter | Schema cache hit rate |
kafka_conversion_duration_seconds | Histogram | Avro to Arrow conversion time |
Storage Metrics
Metric | Type | Description |
---|---|---|
s3_uploads_total | Counter | Total S3 uploads |
s3_upload_duration_seconds | Histogram | S3 upload duration |
s3_upload_bytes | Counter | Bytes uploaded to S3 |
parquet_files_written | Counter | Parquet files created |
parquet_row_groups_written | Counter | Parquet row groups written |
System Metrics
Metric | Type | Description |
---|---|---|
window_queue_depth | Gauge | Current queue depth |
window_backpressure | Gauge | Queue backpressure (0-1) |
buffer_pool_available | Gauge | Available buffers in pool |
buffer_pool_total | Gauge | Total buffer pool size |
rate_limit_throttled | Counter | Rate limited requests |
Grafana Dashboard
BoilStream includes a pre-configured Grafana dashboard. Deploy it using Docker Compose:
bash
# Clone the repository
git clone https://github.com/boilingdata/boilstream.git
cd boilstream
# Start Prometheus and Grafana
docker-compose up -d prometheus grafana
# Access Grafana at http://localhost:3000
# Default credentials: admin/admin
Dashboard Features
- Real-time throughput: Records/sec, MB/sec
- Connection monitoring: Active connections by protocol
- Error tracking: Error rates and types
- Latency percentiles: P50, P95, P99
- Storage performance: Upload rates and sizes
- System health: CPU, memory, queue depths
Custom Metrics
Adding Labels
Metrics include labels for detailed analysis:
prometheus
# Example with labels
ingestion_requests_total{protocol="http",topic="events",status="success"} 12345
ingestion_errors_total{protocol="kafka",error_type="schema_mismatch"} 10
Querying Metrics
Example Prometheus queries:
promql
# Ingestion rate (records/sec)
rate(ingestion_records_total[1m])
# P95 latency
histogram_quantile(0.95, rate(ingestion_duration_seconds_bucket[5m]))
# Error rate percentage
rate(ingestion_errors_total[5m]) / rate(ingestion_requests_total[5m]) * 100
# Active connections by protocol
sum by (protocol) (active_connections)
Alerting Rules
Example Prometheus alerting rules:
yaml
groups:
- name: boilstream_alerts
rules:
- alert: HighErrorRate
expr: rate(ingestion_errors_total[5m]) > 0.01
for: 5m
annotations:
summary: "High error rate detected"
- alert: QueueBackpressure
expr: window_backpressure > 0.8
for: 2m
annotations:
summary: "Queue experiencing backpressure"
- alert: LowBufferPool
expr: buffer_pool_available / buffer_pool_total < 0.1
for: 1m
annotations:
summary: "Buffer pool running low"
Performance Impact
- Metrics collection has minimal overhead (<0.1% CPU)
- Metrics are updated asynchronously
- Prometheus scraping interval: 15s recommended
- Metric cardinality is controlled to prevent explosion
Configuration
Configure metrics in your YAML:
yaml
metrics:
port: 8081
flush_interval_ms: 1000
# Optional: custom labels
labels:
environment: "production"
region: "us-east-1"
Or via environment variables:
bash
export METRICS_PORT=8081
export METRICS_FLUSH_INTERVAL_MS=1000
Integration Examples
Datadog
yaml
# datadog-agent.yaml
instances:
- prometheus_url: http://localhost:8081/metrics
namespace: boilstream
metrics:
- ingestion_*
- flight_*
- http_*
CloudWatch
Use the CloudWatch agent with Prometheus support:
json
{
"metrics": {
"namespace": "BoilStream",
"metrics_collected": {
"prometheus": {
"prometheus_config_path": "/opt/prometheus.yml",
"emf_processor": {
"metric_namespace": "BoilStream"
}
}
}
}
}