Monitoring

Starter+Enterprise

The Monitoring sub-module provides real-time metrics, aggregated logs, and configurable alerts for every service in your Calabi platform — all surfaced through the Calabi Monitoring stack embedded within Cloud Operations. You get a single place to understand platform health, investigate incidents, and get paged before users notice a problem.

Architecture Overview

Metrics are scraped every 15 seconds. Logs are collected in real time via Fluent Bit and indexed in Calabi Logs with a 30-day retention window by default.

Pre-built Dashboards

Calabi ships with a curated set of dashboards that cover the most important operational views. All dashboards are read-only by default; you can clone and customise them.

Platform Health Dashboard

The top-level overview dashboard. It shows:

Panel	Description
Overall health score	Aggregate of service availability across all Calabi pods
Active alerts	Count of currently firing alert rules by severity
Pod restart count (24h)	Total pod restarts across the platform in the last 24 hours
HTTP error rate (5xx)	Platform-wide 5xx rate as a percentage of all requests
P99 response latency	99th-percentile response time across all Calabi API endpoints

Per-Service Resource Usage Dashboard

Shows CPU, memory, and network I/O for each individual Calabi service pod. Useful for identifying resource pressure before it becomes an outage.

Panel	Source	Unit
CPU utilisation	`container_cpu_usage_seconds_total`	Cores / millicores
Memory usage	`container_memory_working_set_bytes`	MiB
Network receive	`container_network_receive_bytes_total`	MB/s
Network transmit	`container_network_transmit_bytes_total`	MB/s

Pipeline Monitoring Dashboard

Tracks Calabi Pipelines operational health:

Panel	Description
DAG run success rate	Percentage of DAG runs that completed successfully in the last 24 hours
Task failure count	Count of failed tasks by DAG name
Scheduler heartbeat	Time since the Airflow scheduler last heartbeated
Zombie task count	Tasks stuck in running state for more than 30 minutes
Queue depth	Number of tasks waiting to be picked up by a worker

BI Analytics (CalabiIQ) Dashboard

Panel	Description
Query latency (P50 / P95 / P99)	SQL query execution time percentiles
Cache hit rate	Percentage of queries served from result cache
Active users (15m window)	Unique users who ran a query in the last 15 minutes
Failed queries	Count of queries that returned an error

Key Metrics Reference

Metric name	Source service	Alert threshold (default)
`calabi_http_requests_5xx_rate`	All services	> 5% of requests over 5 min
`calabi_http_p99_latency_seconds`	All services	> 5 seconds over 5 min
`kube_pod_container_status_restarts_total`	Kubernetes	> 3 restarts in 1 hour
`container_cpu_usage_cores`	All pods	> 90% of CPU limit for 10 min
`container_memory_working_set_bytes`	All pods	> 90% of memory limit for 5 min
`airflow_scheduler_heartbeat`	Calabi Pipelines	No heartbeat for > 60 seconds
`airflow_dag_run_failed`	Calabi Pipelines	Any failure in critical DAG
`calabiiq_query_error_count`	CalabiIQ	> 10 errors in 5 min
`calabi_ml_experiment_run_failed`	Calabi ML	Any run failure
`calabi_connect_sync_lag_seconds`	Calabi Connect	> 3600 seconds (1 hour)

Log Aggregation

All Calabi service pods write structured JSON logs to stdout. Fluent Bit collects these logs and forwards them to Calabi Logs. You can explore logs in the Log Explorer tab within Monitoring.

Log Explorer

The Log Explorer uses LogQL to search and filter logs across all services.

Example LogQL queries:

# All error-level logs from Calabi Pipelines in the last hour
{namespace="master-prod-de", app="airflow"} |= "ERROR"

# 500 errors from the CalabiIQ service
{namespace="master-prod-de", app="calabi-iq"} | json | status >= 500

# Slow queries (> 5 seconds) from any service
{namespace="master-prod-de"} | json | duration > 5000

# Logs from a specific pod
{namespace="master-prod-de", pod="calabi-connect-7b8f9d-xk2m9"}

Log retention

Tier	Default log retention
Starter	7 days
Pro	30 days
Enterprise	90 days (configurable up to 365 days)

Alert Channels

Alerts are triggered by Calabi Monitoring when a metric crosses a defined threshold. Calabi supports three alert delivery channels.

Slack

Go to Cloud Operations > Configure > Notification Channels.
Click Add Channel > Slack.
Paste the Incoming Webhook URL for your Slack workspace.
Set the default channel (e.g., #calabi-alerts).
Click Test to send a test message.
Click Save.

PagerDuty

Go to Cloud Operations > Configure > Notification Channels.
Click Add Channel > PagerDuty.
Enter your PagerDuty Integration Key (from a PagerDuty service integration).
Set the severity mapping: Calabi Critical maps to PagerDuty critical, High maps to error.
Click Save.

Email

Go to Cloud Operations > Configure > Notification Channels.
Click Add Channel > Email.
Enter one or more recipient email addresses.
Set a subject prefix (e.g., [Calabi Alert]).
Click Save.

Alert Routing Rules

You can route different alert types to different channels. For example: route Critical alerts to PagerDuty, High to Slack, and everything else to email.

Rule condition	Channel
`severity = critical`	PagerDuty
`severity = high`	Slack — `#calabi-on-call`
`severity = medium OR low`	Email — ops@yourcompany.com
`service = airflow AND severity = critical`	Slack — `#data-engineering`

Adding Custom Dashboards and Panels

You can add custom dashboards and panels to extend the built-in monitoring coverage.

Clone and extend a built-in dashboard

Open the dashboard you want to extend.
Click the dashboard menu (three dots) and select Clone.
Give the clone a new name and save it to a folder.
Edit the clone — add, remove, or modify panels freely without affecting the original.

Create a new panel

Open a custom dashboard.
Click Add panel.
Enter a PromQL query for metrics or a LogQL query for logs.
Choose the visualization type (time series, stat, gauge, bar chart, table, logs).
Set the panel title, axes labels, and thresholds.
Click Apply.

Example PromQL for a custom panel

# HTTP 5xx rate for the CalabiIQ service only
rate(calabi_http_requests_total{app="calabi-iq", status=~"5.."}[5m])
/
rate(calabi_http_requests_total{app="calabi-iq"}[5m])

# Memory pressure across all Calabi pods
sum by (pod) (
  container_memory_working_set_bytes{namespace="master-prod-de"}
)
/
sum by (pod) (
  kube_pod_container_resource_limits{namespace="master-prod-de", resource="memory"}
)

Configure — Set up notification channels and alert routing rules
Cloud Operations Overview — Return to the module overview

Architecture Overview​

Pre-built Dashboards​

Platform Health Dashboard​

Per-Service Resource Usage Dashboard​

Pipeline Monitoring Dashboard​

BI Analytics (CalabiIQ) Dashboard​

Key Metrics Reference​

Log Aggregation​

Log Explorer​

Log retention​

Alert Channels​

Slack​

PagerDuty​

Email​

Alert Routing Rules​

Adding Custom Dashboards and Panels​

Clone and extend a built-in dashboard​

Create a new panel​

Example PromQL for a custom panel​

Related Pages​