Platform Monitoring
Calabi includes a comprehensive monitoring stack that gives administrators visibility into platform health, Kubernetes pod status, pipeline run metrics, user activity, and system-level resource utilization. This page covers the built-in health dashboards, metric categories, alert configuration, and CloudWatch integration.
Health Dashboard
The Health Dashboard is accessible from Admin → Monitoring → Platform Health. It provides a real-time status overview of every Calabi service.
Service Status Panel
| Service | Status Indicators | Refresh Interval |
|---|---|---|
| Calabi API | HTTP response time, error rate, p99 latency | 30 seconds |
| CalabiIQ | Query engine status, cache hit rate, active connections | 30 seconds |
| Calabi Catalogue | Indexing queue depth, search response time | 30 seconds |
| Calabi Pipelines | Scheduler heartbeat, DAG import errors, task queue depth | 1 minute |
| Calabi Connect | Active syncs, connector error count, last sync time per connector | 1 minute |
| Calabi Automate | Active workflows, execution queue length, failure rate | 30 seconds |
| Calabi AI Builder | LLM request queue, local model pod status, average response time | 1 minute |
| Calabi ML | Tracking server status, artifact store connectivity | 1 minute |
| AI Agent | Request rate, tool call latency, error rate | 30 seconds |
| Database (RDS) | CPU, connections, read/write IOPS, free storage | 1 minute |
| Redis | Memory utilization, command throughput, eviction rate | 1 minute |
| Kubernetes | Pod health by namespace, PVC usage, node CPU/memory | 1 minute |
Each service shows a color-coded status:
- Green — all metrics within normal bounds
- Yellow — one or more metrics approaching warning thresholds
- Red — one or more metrics exceeding critical thresholds or service is unreachable
Kubernetes Pod Health
The Infrastructure tab shows real-time Kubernetes pod health for the Calabi tenant namespace.
Pod Status Table
| Column | Description |
|---|---|
| Pod Name | Kubernetes pod name |
| Service | The Calabi service the pod belongs to |
| Status | Running, Pending, CrashLoopBackOff, OOMKilled, Evicted |
| Restarts | Number of container restarts (high values indicate instability) |
| CPU Request / Limit | Configured vs actual CPU usage |
| Memory Request / Limit | Configured vs actual memory usage |
| Age | Time since pod was created |
| Node | Kubernetes node the pod is scheduled on |
Common Pod Issues and Resolutions
| Status | Likely Cause | Resolution |
|---|---|---|
CrashLoopBackOff | Application error on startup | Check pod logs: kubectl logs <pod> -n calabi-tenant-<id> |
OOMKilled | Memory limit exceeded | Increase memory limits in Helm values |
Pending | No available nodes with sufficient resources | Scale up the Kubernetes node group |
Evicted | Node ran out of memory or disk | Review node pressure; consider adding nodes or increasing storage |
Pipeline Run Metrics
The Pipelines monitoring tab surfaces Calabi Pipelines execution metrics.
Key Pipeline Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| DAG Success Rate (24h) | Percentage of DAG runs that completed successfully | Alert if < 95% |
| DAG Failure Count (1h) | Number of failed DAG runs in the past hour | Alert if > 3 |
| Task Duration P95 | 95th percentile task execution time | Alert if > 2× 30-day average |
| Task Queue Depth | Number of tasks waiting for a worker | Alert if > 50 |
| Scheduler Heartbeat | Seconds since the Calabi Pipelines scheduler last checked in | Alert if > 60s |
| Active DAG Runs | Number of DAGs currently executing | Informational |
| SLA Miss Rate | Percentage of runs that exceeded SLA | Alert if > 5% |
| Zombie Tasks | Tasks that started but have no heartbeat | Alert if > 0 |
User Activity Audit Logs
All user actions in Calabi are recorded in the Audit Log. Access it from Admin → Audit Logs.
Logged Events
| Category | Events Captured |
|---|---|
| Authentication | Login success, login failure, logout, MFA attempt, session revocation |
| User Management | User created, updated, deactivated; role assigned/removed |
| Data Access | SQL query executed, chart viewed, dashboard accessed, data exported |
| Asset Management | Asset description updated, tag applied, owner changed, quality test modified |
| Pipeline Operations | DAG triggered, paused, unpaused; task cleared |
| Automation | Workflow activated/deactivated, credential created/deleted |
| Admin Actions | Helm configuration changed, SSO configured, SCIM token generated |
| AI Agent | Conversation started, tool called, file downloaded |
Audit Log Schema
Each audit event contains:
{
"event_id": "evt_01HXYZ...",
"timestamp": "2026-04-06T14:23:11.453Z",
"event_type": "data.export.csv",
"actor": {
"user_id": "usr_abc123",
"email": "jane.smith@acme.com",
"role": "Analyst",
"ip_address": "203.0.113.42",
"user_agent": "Mozilla/5.0 (Macintosh; ...)"
},
"resource": {
"type": "chart",
"id": "chart_xyz789",
"name": "Q1 Revenue by Region"
},
"outcome": "success",
"metadata": {
"row_count": 15420,
"file_format": "csv",
"query_duration_ms": 1243
}
}
Filtering and Exporting Audit Logs
- Filter by: event type, actor email, date range, resource type, outcome (success/failure).
- Export the filtered log as CSV for compliance reporting.
- Audit logs are retained for 90 days in the Calabi UI; for longer retention, configure CloudWatch export (see below).
Key Metrics to Monitor
| Metric | Source | Warning | Critical | Notes |
|---|---|---|---|---|
| API gateway error rate (5xx) | Calabi API | > 1% | > 5% | Indicates service-level failures |
| CalabiIQ query latency P99 | CalabiIQ | > 10s | > 30s | Affects analyst experience |
| RDS CPU utilization | CloudWatch | > 70% | > 90% | May need vertical scaling |
| RDS free storage | CloudWatch | < 20 GB | < 5 GB | Provision more storage before breach |
| Redis memory utilization | CloudWatch | > 70% | > 90% | Evictions cause session/cache issues |
| Kubernetes node CPU | CloudWatch | > 70% | > 85% | Scale node group before throttling |
| Kubernetes node memory | CloudWatch | > 75% | > 90% | OOMKill risk above 90% |
| Pipeline failure rate (1h) | Calabi Pipelines | > 10% | > 25% | Likely upstream data issue or schema change |
| Calabi Connect sync failures | Calabi Connect | > 1 | > 3 | Source system or credential issue |
| AI Agent error rate | AI Agent | > 5% | > 15% | Check LLM API keys and quota |
| Local model pod memory | Kubernetes | > 80% | > 95% | Model too large for node |
| Audit log ingestion lag | CloudWatch | > 60s | > 300s | Log pipeline issue |
Setting Up Alerts
PagerDuty Integration
- Navigate to Admin → Monitoring → Alerts → + New Alert Channel.
- Select PagerDuty.
- Enter your PagerDuty Integration Key (from PagerDuty → Services → Integrations → Events API v2).
- Click Test to send a test event, then Save.
- Configure alert rules:
- Navigate to Alert Rules → + New Rule.
- Select the metric, threshold, and duration.
- Assign the PagerDuty channel as the notification target.
- Set severity: Warning triggers low-urgency PagerDuty; Critical triggers high-urgency PagerDuty.
Slack Integration
- Navigate to Admin → Monitoring → Alerts → + New Alert Channel.
- Select Slack.
- Enter the Slack Webhook URL (from Slack → Incoming Webhooks → Add to Slack).
- Choose the target channel (e.g.,
#platform-alerts). - Configure which severity levels to send:
- Warning:
#platform-warnings - Critical:
#platform-alerts(multi-channel alerts supported)
- Warning:
Alert Rule Configuration Example
# Configured via Admin UI or Helm values
alert_rules:
- name: "Pipeline failure spike"
metric: "calabi_pipelines_failure_rate_1h"
condition: "> 0.1" # 10% failure rate
duration: "5m"
severity: "warning"
channels: ["slack-warnings"]
- name: "RDS storage critical"
metric: "aws_rds_free_storage_bytes"
condition: "< 5368709120" # 5 GB
duration: "1m"
severity: "critical"
channels: ["pagerduty", "slack-alerts"]
- name: "Calabi Pipelines scheduler offline"
metric: "calabi_pipelines_scheduler_heartbeat_age_seconds"
condition: "> 60"
duration: "2m"
severity: "critical"
channels: ["pagerduty", "slack-alerts"]
CloudWatch Integration
Calabi exports all metrics and logs to AWS CloudWatch, enabling long-term retention, custom dashboards, and integration with your organization's existing AWS monitoring infrastructure.
What Gets Exported
| Category | CloudWatch Namespace | Retention |
|---|---|---|
| Application metrics | Calabi/<tenant-id> | Configurable (default: 15 months) |
| Kubernetes pod metrics | ContainerInsights | 15 months |
| RDS metrics | AWS/RDS | 15 months |
| Application logs | CloudWatch Logs: /calabi/<tenant-id>/app | 90 days (configurable) |
| Audit logs | CloudWatch Logs: /calabi/<tenant-id>/audit | 7 years (configurable) |
| Kubernetes logs | CloudWatch Logs: /calabi/<tenant-id>/k8s | 30 days |
Configuring CloudWatch Export
In your Calabi Helm values (client/values.yaml):
monitoring:
cloudwatch:
enabled: true
region: "us-east-1"
logRetentionDays: 90
auditLogRetentionDays: 2555 # 7 years for compliance
metrics:
enabled: true
namespace: "Calabi/prod"
logs:
enabled: true
logGroupPrefix: "/calabi/prod"
CloudWatch Alarms via Terraform
Calabi ships a Terraform module that provisions the recommended CloudWatch alarms:
cd calabi-infra/modules/cloudwatch-alarms
terraform apply \
-var="tenant_id=my-company" \
-var="pagerduty_sns_arn=arn:aws:sns:us-east-1:123456789:pd-critical" \
-var="slack_sns_arn=arn:aws:sns:us-east-1:123456789:slack-warnings"
Related Pages
- Roles & Permissions — Who can access monitoring data
- Helm Configuration Reference — Configure monitoring settings in Helm
- Backup & Recovery — Monitor backup job health