Platform Monitoring

All Tiers

Calabi includes a comprehensive monitoring stack that gives administrators visibility into platform health, Kubernetes pod status, pipeline run metrics, user activity, and system-level resource utilization. This page covers the built-in health dashboards, metric categories, alert configuration, and CloudWatch integration.

Health Dashboard

The Health Dashboard is accessible from Admin → Monitoring → Platform Health. It provides a real-time status overview of every Calabi service.

Service Status Panel

Service	Status Indicators	Refresh Interval
Calabi API	HTTP response time, error rate, p99 latency	30 seconds
CalabiIQ	Query engine status, cache hit rate, active connections	30 seconds
Calabi Catalogue	Indexing queue depth, search response time	30 seconds
Calabi Pipelines	Scheduler heartbeat, DAG import errors, task queue depth	1 minute
Calabi Connect	Active syncs, connector error count, last sync time per connector	1 minute
Calabi Automate	Active workflows, execution queue length, failure rate	30 seconds
Calabi AI Builder	LLM request queue, local model pod status, average response time	1 minute
Calabi ML	Tracking server status, artifact store connectivity	1 minute
AI Agent	Request rate, tool call latency, error rate	30 seconds
Database (RDS)	CPU, connections, read/write IOPS, free storage	1 minute
Redis	Memory utilization, command throughput, eviction rate	1 minute
Kubernetes	Pod health by namespace, PVC usage, node CPU/memory	1 minute

Each service shows a color-coded status:

Green — all metrics within normal bounds
Yellow — one or more metrics approaching warning thresholds
Red — one or more metrics exceeding critical thresholds or service is unreachable

Kubernetes Pod Health

The Infrastructure tab shows real-time Kubernetes pod health for the Calabi tenant namespace.

Pod Status Table

Column	Description
Pod Name	Kubernetes pod name
Service	The Calabi service the pod belongs to
Status	Running, Pending, CrashLoopBackOff, OOMKilled, Evicted
Restarts	Number of container restarts (high values indicate instability)
CPU Request / Limit	Configured vs actual CPU usage
Memory Request / Limit	Configured vs actual memory usage
Age	Time since pod was created
Node	Kubernetes node the pod is scheduled on

Common Pod Issues and Resolutions

Status	Likely Cause	Resolution
`CrashLoopBackOff`	Application error on startup	Check pod logs: `kubectl logs <pod> -n calabi-tenant-<id>`
`OOMKilled`	Memory limit exceeded	Increase memory limits in Helm values
`Pending`	No available nodes with sufficient resources	Scale up the Kubernetes node group
`Evicted`	Node ran out of memory or disk	Review node pressure; consider adding nodes or increasing storage

Pipeline Run Metrics

The Pipelines monitoring tab surfaces Calabi Pipelines execution metrics.

Key Pipeline Metrics

Metric	Description	Alert Threshold
DAG Success Rate (24h)	Percentage of DAG runs that completed successfully	Alert if < 95%
DAG Failure Count (1h)	Number of failed DAG runs in the past hour	Alert if > 3
Task Duration P95	95th percentile task execution time	Alert if > 2× 30-day average
Task Queue Depth	Number of tasks waiting for a worker	Alert if > 50
Scheduler Heartbeat	Seconds since the Calabi Pipelines scheduler last checked in	Alert if > 60s
Active DAG Runs	Number of DAGs currently executing	Informational
SLA Miss Rate	Percentage of runs that exceeded SLA	Alert if > 5%
Zombie Tasks	Tasks that started but have no heartbeat	Alert if > 0

User Activity Audit Logs

All user actions in Calabi are recorded in the Audit Log. Access it from Admin → Audit Logs.

Logged Events

Category	Events Captured
Authentication	Login success, login failure, logout, MFA attempt, session revocation
User Management	User created, updated, deactivated; role assigned/removed
Data Access	SQL query executed, chart viewed, dashboard accessed, data exported
Asset Management	Asset description updated, tag applied, owner changed, quality test modified
Pipeline Operations	DAG triggered, paused, unpaused; task cleared
Automation	Workflow activated/deactivated, credential created/deleted
Admin Actions	Helm configuration changed, SSO configured, SCIM token generated
AI Agent	Conversation started, tool called, file downloaded

Audit Log Schema

Each audit event contains:

{
  "event_id": "evt_01HXYZ...",
  "timestamp": "2026-04-06T14:23:11.453Z",
  "event_type": "data.export.csv",
  "actor": {
    "user_id": "usr_abc123",
    "email": "jane.smith@acme.com",
    "role": "Analyst",
    "ip_address": "203.0.113.42",
    "user_agent": "Mozilla/5.0 (Macintosh; ...)"
  },
  "resource": {
    "type": "chart",
    "id": "chart_xyz789",
    "name": "Q1 Revenue by Region"
  },
  "outcome": "success",
  "metadata": {
    "row_count": 15420,
    "file_format": "csv",
    "query_duration_ms": 1243
  }
}

Filtering and Exporting Audit Logs

Filter by: event type, actor email, date range, resource type, outcome (success/failure).
Export the filtered log as CSV for compliance reporting.
Audit logs are retained for 90 days in the Calabi UI; for longer retention, configure CloudWatch export (see below).

Key Metrics to Monitor

Metric	Source	Warning	Critical	Notes
API gateway error rate (5xx)	Calabi API	> 1%	> 5%	Indicates service-level failures
CalabiIQ query latency P99	CalabiIQ	> 10s	> 30s	Affects analyst experience
RDS CPU utilization	CloudWatch	> 70%	> 90%	May need vertical scaling
RDS free storage	CloudWatch	< 20 GB	< 5 GB	Provision more storage before breach
Redis memory utilization	CloudWatch	> 70%	> 90%	Evictions cause session/cache issues
Kubernetes node CPU	CloudWatch	> 70%	> 85%	Scale node group before throttling
Kubernetes node memory	CloudWatch	> 75%	> 90%	OOMKill risk above 90%
Pipeline failure rate (1h)	Calabi Pipelines	> 10%	> 25%	Likely upstream data issue or schema change
Calabi Connect sync failures	Calabi Connect	> 1	> 3	Source system or credential issue
AI Agent error rate	AI Agent	> 5%	> 15%	Check LLM API keys and quota
Local model pod memory	Kubernetes	> 80%	> 95%	Model too large for node
Audit log ingestion lag	CloudWatch	> 60s	> 300s	Log pipeline issue

Setting Up Alerts

PagerDuty Integration

Navigate to Admin → Monitoring → Alerts → + New Alert Channel.
Select PagerDuty.
Enter your PagerDuty Integration Key (from PagerDuty → Services → Integrations → Events API v2).
Click Test to send a test event, then Save.
Configure alert rules:
- Navigate to Alert Rules → + New Rule.
- Select the metric, threshold, and duration.
- Assign the PagerDuty channel as the notification target.
- Set severity: Warning triggers low-urgency PagerDuty; Critical triggers high-urgency PagerDuty.

Slack Integration

Navigate to Admin → Monitoring → Alerts → + New Alert Channel.
Select Slack.
Enter the Slack Webhook URL (from Slack → Incoming Webhooks → Add to Slack).
Choose the target channel (e.g., #platform-alerts).
Configure which severity levels to send:
- Warning: #platform-warnings
- Critical: #platform-alerts (multi-channel alerts supported)

Alert Rule Configuration Example

# Configured via Admin UI or Helm values
alert_rules:
  - name: "Pipeline failure spike"
    metric: "calabi_pipelines_failure_rate_1h"
    condition: "> 0.1"       # 10% failure rate
    duration: "5m"
    severity: "warning"
    channels: ["slack-warnings"]

  - name: "RDS storage critical"
    metric: "aws_rds_free_storage_bytes"
    condition: "< 5368709120"  # 5 GB
    duration: "1m"
    severity: "critical"
    channels: ["pagerduty", "slack-alerts"]

  - name: "Calabi Pipelines scheduler offline"
    metric: "calabi_pipelines_scheduler_heartbeat_age_seconds"
    condition: "> 60"
    duration: "2m"
    severity: "critical"
    channels: ["pagerduty", "slack-alerts"]

CloudWatch Integration

Calabi exports all metrics and logs to AWS CloudWatch, enabling long-term retention, custom dashboards, and integration with your organization's existing AWS monitoring infrastructure.

What Gets Exported

Category	CloudWatch Namespace	Retention
Application metrics	`Calabi/<tenant-id>`	Configurable (default: 15 months)
Kubernetes pod metrics	`ContainerInsights`	15 months
RDS metrics	`AWS/RDS`	15 months
Application logs	CloudWatch Logs: `/calabi/<tenant-id>/app`	90 days (configurable)
Audit logs	CloudWatch Logs: `/calabi/<tenant-id>/audit`	7 years (configurable)
Kubernetes logs	CloudWatch Logs: `/calabi/<tenant-id>/k8s`	30 days

Configuring CloudWatch Export

In your Calabi Helm values (client/values.yaml):

monitoring:
  cloudwatch:
    enabled: true
    region: "us-east-1"
    logRetentionDays: 90
    auditLogRetentionDays: 2555   # 7 years for compliance
    metrics:
      enabled: true
      namespace: "Calabi/prod"
    logs:
      enabled: true
      logGroupPrefix: "/calabi/prod"

CloudWatch Alarms via Terraform

Calabi ships a Terraform module that provisions the recommended CloudWatch alarms:

cd calabi-infra/modules/cloudwatch-alarms
terraform apply \
  -var="tenant_id=my-company" \
  -var="pagerduty_sns_arn=arn:aws:sns:us-east-1:123456789:pd-critical" \
  -var="slack_sns_arn=arn:aws:sns:us-east-1:123456789:slack-warnings"

Roles & Permissions — Who can access monitoring data
Helm Configuration Reference — Configure monitoring settings in Helm
Backup & Recovery — Monitor backup job health

Health Dashboard​

Service Status Panel​

Kubernetes Pod Health​

Pod Status Table​

Common Pod Issues and Resolutions​

Pipeline Run Metrics​

Key Pipeline Metrics​

User Activity Audit Logs​

Logged Events​

Audit Log Schema​

Filtering and Exporting Audit Logs​

Key Metrics to Monitor​

Setting Up Alerts​

PagerDuty Integration​

Slack Integration​

Alert Rule Configuration Example​

CloudWatch Integration​

What Gets Exported​

Configuring CloudWatch Export​

CloudWatch Alarms via Terraform​

Related Pages​