Skip to main content

Monitoring

Starter+Enterprise

The Monitoring sub-module provides real-time metrics, aggregated logs, and configurable alerts for every service in your Calabi platform — all surfaced through the Calabi Monitoring stack embedded within Cloud Operations. You get a single place to understand platform health, investigate incidents, and get paged before users notice a problem.


Architecture Overview

Metrics are scraped every 15 seconds. Logs are collected in real time via Fluent Bit and indexed in Calabi Logs with a 30-day retention window by default.


Pre-built Dashboards

Calabi ships with a curated set of dashboards that cover the most important operational views. All dashboards are read-only by default; you can clone and customise them.

Platform Health Dashboard

The top-level overview dashboard. It shows:

PanelDescription
Overall health scoreAggregate of service availability across all Calabi pods
Active alertsCount of currently firing alert rules by severity
Pod restart count (24h)Total pod restarts across the platform in the last 24 hours
HTTP error rate (5xx)Platform-wide 5xx rate as a percentage of all requests
P99 response latency99th-percentile response time across all Calabi API endpoints

Per-Service Resource Usage Dashboard

Shows CPU, memory, and network I/O for each individual Calabi service pod. Useful for identifying resource pressure before it becomes an outage.

PanelSourceUnit
CPU utilisationcontainer_cpu_usage_seconds_totalCores / millicores
Memory usagecontainer_memory_working_set_bytesMiB
Network receivecontainer_network_receive_bytes_totalMB/s
Network transmitcontainer_network_transmit_bytes_totalMB/s

Pipeline Monitoring Dashboard

Tracks Calabi Pipelines operational health:

PanelDescription
DAG run success ratePercentage of DAG runs that completed successfully in the last 24 hours
Task failure countCount of failed tasks by DAG name
Scheduler heartbeatTime since the Airflow scheduler last heartbeated
Zombie task countTasks stuck in running state for more than 30 minutes
Queue depthNumber of tasks waiting to be picked up by a worker

BI Analytics (CalabiIQ) Dashboard

PanelDescription
Query latency (P50 / P95 / P99)SQL query execution time percentiles
Cache hit ratePercentage of queries served from result cache
Active users (15m window)Unique users who ran a query in the last 15 minutes
Failed queriesCount of queries that returned an error

Key Metrics Reference

Metric nameSource serviceAlert threshold (default)
calabi_http_requests_5xx_rateAll services> 5% of requests over 5 min
calabi_http_p99_latency_secondsAll services> 5 seconds over 5 min
kube_pod_container_status_restarts_totalKubernetes> 3 restarts in 1 hour
container_cpu_usage_coresAll pods> 90% of CPU limit for 10 min
container_memory_working_set_bytesAll pods> 90% of memory limit for 5 min
airflow_scheduler_heartbeatCalabi PipelinesNo heartbeat for > 60 seconds
airflow_dag_run_failedCalabi PipelinesAny failure in critical DAG
calabiiq_query_error_countCalabiIQ> 10 errors in 5 min
calabi_ml_experiment_run_failedCalabi MLAny run failure
calabi_connect_sync_lag_secondsCalabi Connect> 3600 seconds (1 hour)

Log Aggregation

All Calabi service pods write structured JSON logs to stdout. Fluent Bit collects these logs and forwards them to Calabi Logs. You can explore logs in the Log Explorer tab within Monitoring.

Log Explorer

The Log Explorer uses LogQL to search and filter logs across all services.

Example LogQL queries:

# All error-level logs from Calabi Pipelines in the last hour
{namespace="master-prod-de", app="airflow"} |= "ERROR"

# 500 errors from the CalabiIQ service
{namespace="master-prod-de", app="calabi-iq"} | json | status >= 500

# Slow queries (> 5 seconds) from any service
{namespace="master-prod-de"} | json | duration > 5000

# Logs from a specific pod
{namespace="master-prod-de", pod="calabi-connect-7b8f9d-xk2m9"}

Log retention

TierDefault log retention
Starter7 days
Pro30 days
Enterprise90 days (configurable up to 365 days)

Alert Channels

Alerts are triggered by Calabi Monitoring when a metric crosses a defined threshold. Calabi supports three alert delivery channels.

Slack

  1. Go to Cloud Operations > Configure > Notification Channels.
  2. Click Add Channel > Slack.
  3. Paste the Incoming Webhook URL for your Slack workspace.
  4. Set the default channel (e.g., #calabi-alerts).
  5. Click Test to send a test message.
  6. Click Save.

PagerDuty

  1. Go to Cloud Operations > Configure > Notification Channels.
  2. Click Add Channel > PagerDuty.
  3. Enter your PagerDuty Integration Key (from a PagerDuty service integration).
  4. Set the severity mapping: Calabi Critical maps to PagerDuty critical, High maps to error.
  5. Click Save.

Email

  1. Go to Cloud Operations > Configure > Notification Channels.
  2. Click Add Channel > Email.
  3. Enter one or more recipient email addresses.
  4. Set a subject prefix (e.g., [Calabi Alert]).
  5. Click Save.

Alert Routing Rules

You can route different alert types to different channels. For example: route Critical alerts to PagerDuty, High to Slack, and everything else to email.

Rule conditionChannel
severity = criticalPagerDuty
severity = highSlack — #calabi-on-call
severity = medium OR lowEmail — ops@yourcompany.com
service = airflow AND severity = criticalSlack — #data-engineering

Adding Custom Dashboards and Panels

You can add custom dashboards and panels to extend the built-in monitoring coverage.

Clone and extend a built-in dashboard

  1. Open the dashboard you want to extend.
  2. Click the dashboard menu (three dots) and select Clone.
  3. Give the clone a new name and save it to a folder.
  4. Edit the clone — add, remove, or modify panels freely without affecting the original.

Create a new panel

  1. Open a custom dashboard.
  2. Click Add panel.
  3. Enter a PromQL query for metrics or a LogQL query for logs.
  4. Choose the visualization type (time series, stat, gauge, bar chart, table, logs).
  5. Set the panel title, axes labels, and thresholds.
  6. Click Apply.

Example PromQL for a custom panel

# HTTP 5xx rate for the CalabiIQ service only
rate(calabi_http_requests_total{app="calabi-iq", status=~"5.."}[5m])
/
rate(calabi_http_requests_total{app="calabi-iq"}[5m])
# Memory pressure across all Calabi pods
sum by (pod) (
container_memory_working_set_bytes{namespace="master-prod-de"}
)
/
sum by (pod) (
kube_pod_container_resource_limits{namespace="master-prod-de", resource="memory"}
)