Monitoring
The Monitoring sub-module provides real-time metrics, aggregated logs, and configurable alerts for every service in your Calabi platform — all surfaced through the Calabi Monitoring stack embedded within Cloud Operations. You get a single place to understand platform health, investigate incidents, and get paged before users notice a problem.
Architecture Overview
Metrics are scraped every 15 seconds. Logs are collected in real time via Fluent Bit and indexed in Calabi Logs with a 30-day retention window by default.
Pre-built Dashboards
Calabi ships with a curated set of dashboards that cover the most important operational views. All dashboards are read-only by default; you can clone and customise them.
Platform Health Dashboard
The top-level overview dashboard. It shows:
| Panel | Description |
|---|---|
| Overall health score | Aggregate of service availability across all Calabi pods |
| Active alerts | Count of currently firing alert rules by severity |
| Pod restart count (24h) | Total pod restarts across the platform in the last 24 hours |
| HTTP error rate (5xx) | Platform-wide 5xx rate as a percentage of all requests |
| P99 response latency | 99th-percentile response time across all Calabi API endpoints |
Per-Service Resource Usage Dashboard
Shows CPU, memory, and network I/O for each individual Calabi service pod. Useful for identifying resource pressure before it becomes an outage.
| Panel | Source | Unit |
|---|---|---|
| CPU utilisation | container_cpu_usage_seconds_total | Cores / millicores |
| Memory usage | container_memory_working_set_bytes | MiB |
| Network receive | container_network_receive_bytes_total | MB/s |
| Network transmit | container_network_transmit_bytes_total | MB/s |
Pipeline Monitoring Dashboard
Tracks Calabi Pipelines operational health:
| Panel | Description |
|---|---|
| DAG run success rate | Percentage of DAG runs that completed successfully in the last 24 hours |
| Task failure count | Count of failed tasks by DAG name |
| Scheduler heartbeat | Time since the Airflow scheduler last heartbeated |
| Zombie task count | Tasks stuck in running state for more than 30 minutes |
| Queue depth | Number of tasks waiting to be picked up by a worker |
BI Analytics (CalabiIQ) Dashboard
| Panel | Description |
|---|---|
| Query latency (P50 / P95 / P99) | SQL query execution time percentiles |
| Cache hit rate | Percentage of queries served from result cache |
| Active users (15m window) | Unique users who ran a query in the last 15 minutes |
| Failed queries | Count of queries that returned an error |
Key Metrics Reference
| Metric name | Source service | Alert threshold (default) |
|---|---|---|
calabi_http_requests_5xx_rate | All services | > 5% of requests over 5 min |
calabi_http_p99_latency_seconds | All services | > 5 seconds over 5 min |
kube_pod_container_status_restarts_total | Kubernetes | > 3 restarts in 1 hour |
container_cpu_usage_cores | All pods | > 90% of CPU limit for 10 min |
container_memory_working_set_bytes | All pods | > 90% of memory limit for 5 min |
airflow_scheduler_heartbeat | Calabi Pipelines | No heartbeat for > 60 seconds |
airflow_dag_run_failed | Calabi Pipelines | Any failure in critical DAG |
calabiiq_query_error_count | CalabiIQ | > 10 errors in 5 min |
calabi_ml_experiment_run_failed | Calabi ML | Any run failure |
calabi_connect_sync_lag_seconds | Calabi Connect | > 3600 seconds (1 hour) |
Log Aggregation
All Calabi service pods write structured JSON logs to stdout. Fluent Bit collects these logs and forwards them to Calabi Logs. You can explore logs in the Log Explorer tab within Monitoring.
Log Explorer
The Log Explorer uses LogQL to search and filter logs across all services.
Example LogQL queries:
# All error-level logs from Calabi Pipelines in the last hour
{namespace="master-prod-de", app="airflow"} |= "ERROR"
# 500 errors from the CalabiIQ service
{namespace="master-prod-de", app="calabi-iq"} | json | status >= 500
# Slow queries (> 5 seconds) from any service
{namespace="master-prod-de"} | json | duration > 5000
# Logs from a specific pod
{namespace="master-prod-de", pod="calabi-connect-7b8f9d-xk2m9"}
Log retention
| Tier | Default log retention |
|---|---|
| Starter | 7 days |
| Pro | 30 days |
| Enterprise | 90 days (configurable up to 365 days) |
Alert Channels
Alerts are triggered by Calabi Monitoring when a metric crosses a defined threshold. Calabi supports three alert delivery channels.
Slack
- Go to Cloud Operations > Configure > Notification Channels.
- Click Add Channel > Slack.
- Paste the Incoming Webhook URL for your Slack workspace.
- Set the default channel (e.g.,
#calabi-alerts). - Click Test to send a test message.
- Click Save.
PagerDuty
- Go to Cloud Operations > Configure > Notification Channels.
- Click Add Channel > PagerDuty.
- Enter your PagerDuty Integration Key (from a PagerDuty service integration).
- Set the severity mapping: Calabi Critical maps to PagerDuty
critical, High maps toerror. - Click Save.
Email
- Go to Cloud Operations > Configure > Notification Channels.
- Click Add Channel > Email.
- Enter one or more recipient email addresses.
- Set a subject prefix (e.g.,
[Calabi Alert]). - Click Save.
Alert Routing Rules
You can route different alert types to different channels. For example: route Critical alerts to PagerDuty, High to Slack, and everything else to email.
| Rule condition | Channel |
|---|---|
severity = critical | PagerDuty |
severity = high | Slack — #calabi-on-call |
severity = medium OR low | Email — ops@yourcompany.com |
service = airflow AND severity = critical | Slack — #data-engineering |
Adding Custom Dashboards and Panels
You can add custom dashboards and panels to extend the built-in monitoring coverage.
Clone and extend a built-in dashboard
- Open the dashboard you want to extend.
- Click the dashboard menu (three dots) and select Clone.
- Give the clone a new name and save it to a folder.
- Edit the clone — add, remove, or modify panels freely without affecting the original.
Create a new panel
- Open a custom dashboard.
- Click Add panel.
- Enter a PromQL query for metrics or a LogQL query for logs.
- Choose the visualization type (time series, stat, gauge, bar chart, table, logs).
- Set the panel title, axes labels, and thresholds.
- Click Apply.
Example PromQL for a custom panel
# HTTP 5xx rate for the CalabiIQ service only
rate(calabi_http_requests_total{app="calabi-iq", status=~"5.."}[5m])
/
rate(calabi_http_requests_total{app="calabi-iq"}[5m])
# Memory pressure across all Calabi pods
sum by (pod) (
container_memory_working_set_bytes{namespace="master-prod-de"}
)
/
sum by (pod) (
kube_pod_container_resource_limits{namespace="master-prod-de", resource="memory"}
)
Related Pages
- Configure — Set up notification channels and alert routing rules
- Cloud Operations Overview — Return to the module overview