Skip to main content

Platform Monitoring

All Tiers

Calabi includes a comprehensive monitoring stack that gives administrators visibility into platform health, Kubernetes pod status, pipeline run metrics, user activity, and system-level resource utilization. This page covers the built-in health dashboards, metric categories, alert configuration, and CloudWatch integration.


Health Dashboard

The Health Dashboard is accessible from AdminMonitoringPlatform Health. It provides a real-time status overview of every Calabi service.

Service Status Panel

ServiceStatus IndicatorsRefresh Interval
Calabi APIHTTP response time, error rate, p99 latency30 seconds
CalabiIQQuery engine status, cache hit rate, active connections30 seconds
Calabi CatalogueIndexing queue depth, search response time30 seconds
Calabi PipelinesScheduler heartbeat, DAG import errors, task queue depth1 minute
Calabi ConnectActive syncs, connector error count, last sync time per connector1 minute
Calabi AutomateActive workflows, execution queue length, failure rate30 seconds
Calabi AI BuilderLLM request queue, local model pod status, average response time1 minute
Calabi MLTracking server status, artifact store connectivity1 minute
AI AgentRequest rate, tool call latency, error rate30 seconds
Database (RDS)CPU, connections, read/write IOPS, free storage1 minute
RedisMemory utilization, command throughput, eviction rate1 minute
KubernetesPod health by namespace, PVC usage, node CPU/memory1 minute

Each service shows a color-coded status:

  • Green — all metrics within normal bounds
  • Yellow — one or more metrics approaching warning thresholds
  • Red — one or more metrics exceeding critical thresholds or service is unreachable

Kubernetes Pod Health

The Infrastructure tab shows real-time Kubernetes pod health for the Calabi tenant namespace.

Pod Status Table

ColumnDescription
Pod NameKubernetes pod name
ServiceThe Calabi service the pod belongs to
StatusRunning, Pending, CrashLoopBackOff, OOMKilled, Evicted
RestartsNumber of container restarts (high values indicate instability)
CPU Request / LimitConfigured vs actual CPU usage
Memory Request / LimitConfigured vs actual memory usage
AgeTime since pod was created
NodeKubernetes node the pod is scheduled on

Common Pod Issues and Resolutions

StatusLikely CauseResolution
CrashLoopBackOffApplication error on startupCheck pod logs: kubectl logs <pod> -n calabi-tenant-<id>
OOMKilledMemory limit exceededIncrease memory limits in Helm values
PendingNo available nodes with sufficient resourcesScale up the Kubernetes node group
EvictedNode ran out of memory or diskReview node pressure; consider adding nodes or increasing storage

Pipeline Run Metrics

The Pipelines monitoring tab surfaces Calabi Pipelines execution metrics.

Key Pipeline Metrics

MetricDescriptionAlert Threshold
DAG Success Rate (24h)Percentage of DAG runs that completed successfullyAlert if < 95%
DAG Failure Count (1h)Number of failed DAG runs in the past hourAlert if > 3
Task Duration P9595th percentile task execution timeAlert if > 2× 30-day average
Task Queue DepthNumber of tasks waiting for a workerAlert if > 50
Scheduler HeartbeatSeconds since the Calabi Pipelines scheduler last checked inAlert if > 60s
Active DAG RunsNumber of DAGs currently executingInformational
SLA Miss RatePercentage of runs that exceeded SLAAlert if > 5%
Zombie TasksTasks that started but have no heartbeatAlert if > 0

User Activity Audit Logs

All user actions in Calabi are recorded in the Audit Log. Access it from AdminAudit Logs.

Logged Events

CategoryEvents Captured
AuthenticationLogin success, login failure, logout, MFA attempt, session revocation
User ManagementUser created, updated, deactivated; role assigned/removed
Data AccessSQL query executed, chart viewed, dashboard accessed, data exported
Asset ManagementAsset description updated, tag applied, owner changed, quality test modified
Pipeline OperationsDAG triggered, paused, unpaused; task cleared
AutomationWorkflow activated/deactivated, credential created/deleted
Admin ActionsHelm configuration changed, SSO configured, SCIM token generated
AI AgentConversation started, tool called, file downloaded

Audit Log Schema

Each audit event contains:

{
"event_id": "evt_01HXYZ...",
"timestamp": "2026-04-06T14:23:11.453Z",
"event_type": "data.export.csv",
"actor": {
"user_id": "usr_abc123",
"email": "jane.smith@acme.com",
"role": "Analyst",
"ip_address": "203.0.113.42",
"user_agent": "Mozilla/5.0 (Macintosh; ...)"
},
"resource": {
"type": "chart",
"id": "chart_xyz789",
"name": "Q1 Revenue by Region"
},
"outcome": "success",
"metadata": {
"row_count": 15420,
"file_format": "csv",
"query_duration_ms": 1243
}
}

Filtering and Exporting Audit Logs

  • Filter by: event type, actor email, date range, resource type, outcome (success/failure).
  • Export the filtered log as CSV for compliance reporting.
  • Audit logs are retained for 90 days in the Calabi UI; for longer retention, configure CloudWatch export (see below).

Key Metrics to Monitor

MetricSourceWarningCriticalNotes
API gateway error rate (5xx)Calabi API> 1%> 5%Indicates service-level failures
CalabiIQ query latency P99CalabiIQ> 10s> 30sAffects analyst experience
RDS CPU utilizationCloudWatch> 70%> 90%May need vertical scaling
RDS free storageCloudWatch< 20 GB< 5 GBProvision more storage before breach
Redis memory utilizationCloudWatch> 70%> 90%Evictions cause session/cache issues
Kubernetes node CPUCloudWatch> 70%> 85%Scale node group before throttling
Kubernetes node memoryCloudWatch> 75%> 90%OOMKill risk above 90%
Pipeline failure rate (1h)Calabi Pipelines> 10%> 25%Likely upstream data issue or schema change
Calabi Connect sync failuresCalabi Connect> 1> 3Source system or credential issue
AI Agent error rateAI Agent> 5%> 15%Check LLM API keys and quota
Local model pod memoryKubernetes> 80%> 95%Model too large for node
Audit log ingestion lagCloudWatch> 60s> 300sLog pipeline issue

Setting Up Alerts

PagerDuty Integration

  1. Navigate to AdminMonitoringAlerts+ New Alert Channel.
  2. Select PagerDuty.
  3. Enter your PagerDuty Integration Key (from PagerDuty → Services → Integrations → Events API v2).
  4. Click Test to send a test event, then Save.
  5. Configure alert rules:
    • Navigate to Alert Rules+ New Rule.
    • Select the metric, threshold, and duration.
    • Assign the PagerDuty channel as the notification target.
    • Set severity: Warning triggers low-urgency PagerDuty; Critical triggers high-urgency PagerDuty.

Slack Integration

  1. Navigate to AdminMonitoringAlerts+ New Alert Channel.
  2. Select Slack.
  3. Enter the Slack Webhook URL (from Slack → Incoming Webhooks → Add to Slack).
  4. Choose the target channel (e.g., #platform-alerts).
  5. Configure which severity levels to send:
    • Warning: #platform-warnings
    • Critical: #platform-alerts (multi-channel alerts supported)

Alert Rule Configuration Example

# Configured via Admin UI or Helm values
alert_rules:
- name: "Pipeline failure spike"
metric: "calabi_pipelines_failure_rate_1h"
condition: "> 0.1" # 10% failure rate
duration: "5m"
severity: "warning"
channels: ["slack-warnings"]

- name: "RDS storage critical"
metric: "aws_rds_free_storage_bytes"
condition: "< 5368709120" # 5 GB
duration: "1m"
severity: "critical"
channels: ["pagerduty", "slack-alerts"]

- name: "Calabi Pipelines scheduler offline"
metric: "calabi_pipelines_scheduler_heartbeat_age_seconds"
condition: "> 60"
duration: "2m"
severity: "critical"
channels: ["pagerduty", "slack-alerts"]

CloudWatch Integration

Calabi exports all metrics and logs to AWS CloudWatch, enabling long-term retention, custom dashboards, and integration with your organization's existing AWS monitoring infrastructure.

What Gets Exported

CategoryCloudWatch NamespaceRetention
Application metricsCalabi/<tenant-id>Configurable (default: 15 months)
Kubernetes pod metricsContainerInsights15 months
RDS metricsAWS/RDS15 months
Application logsCloudWatch Logs: /calabi/<tenant-id>/app90 days (configurable)
Audit logsCloudWatch Logs: /calabi/<tenant-id>/audit7 years (configurable)
Kubernetes logsCloudWatch Logs: /calabi/<tenant-id>/k8s30 days

Configuring CloudWatch Export

In your Calabi Helm values (client/values.yaml):

monitoring:
cloudwatch:
enabled: true
region: "us-east-1"
logRetentionDays: 90
auditLogRetentionDays: 2555 # 7 years for compliance
metrics:
enabled: true
namespace: "Calabi/prod"
logs:
enabled: true
logGroupPrefix: "/calabi/prod"

CloudWatch Alarms via Terraform

Calabi ships a Terraform module that provisions the recommended CloudWatch alarms:

cd calabi-infra/modules/cloudwatch-alarms
terraform apply \
-var="tenant_id=my-company" \
-var="pagerduty_sns_arn=arn:aws:sns:us-east-1:123456789:pd-critical" \
-var="slack_sns_arn=arn:aws:sns:us-east-1:123456789:slack-warnings"