Backup & Recovery

All Tiers

Calabi performs automated daily backups of all critical platform data. This page documents what is backed up, backup frequency and retention, how to restore from backup, RTO/RPO targets, and a disaster recovery checklist.

What Calabi Backs Up

Calabi's backup system protects all stateful components of the platform. The following table details every backed-up component and its importance.

Component	What Is Backed Up	Storage Location	Criticality
Metadata Database (RDS)	All Calabi configuration, user accounts, roles, Calabi Catalogue asset metadata, CalabiIQ charts/dashboards, Calabi Automate workflows and credentials, AI Builder chatflows	S3 (encrypted)	Critical
Calabi Catalogue Search Index	Search index of all catalogued assets, tags, lineage, and quality results	S3 snapshot	High
Calabi Pipelines State	DAG definitions, connection configurations, variables, Calabi Pipelines database (run history, task instances, XComs)	S3 (encrypted)	High
Calabi ML Artifacts	Experiment metadata, model weights, plots, and data samples stored in the Calabi ML artifact store	S3 (versioned)	High
AI Builder Vector Stores	All pgvector embeddings and document chunks used by RAG agents	RDS snapshot + S3	High
Kubernetes Persistent Volumes	Local model cache, Redis persistence files	EBS snapshots	Medium
Helm Configuration	Deployed Helm values files (per-tenant)	S3 (versioned)	Medium
Audit Logs	Complete user activity audit log (separate long-retention path)	CloudWatch Logs + S3	Compliance

What Is NOT Backed Up

Component	Reason
Source data in your data warehouse	Your warehouse (Redshift, Snowflake, BigQuery, etc.) has its own backup mechanisms outside of Calabi
Raw files in user-managed S3 buckets	Not managed by Calabi
Calabi Connect sync state (in-flight records)	Connectors re-sync from source on restart; data is not lost

Backup Frequency and Retention

Automated Backup Schedule

Component	Schedule	Retention
RDS (full snapshot)	Daily at 02:00 UTC	30 daily, 12 weekly, 24 monthly
RDS (transaction logs)	Continuous (5-minute intervals)	7 days (enables point-in-time recovery)
Search index snapshot	Daily at 02:30 UTC	14 days
Calabi Pipelines DB	Daily at 02:00 UTC	30 days
ML artifact store	Daily at 03:00 UTC	90 days
EBS volumes	Daily at 04:00 UTC	7 days
Helm configuration	On every Helm deployment	Unlimited (versioned S3)
Audit logs (CloudWatch → S3)	Continuous export	7 years

Backup Storage

All backups are stored in a dedicated S3 bucket per tenant:

s3://calabi-backups-<tenant-id>/
├── rds/
│   ├── snapshots/
│   └── pitr/
├── search-index/
├── pipelines/
├── ml-artifacts/
├── ebs-snapshots/
└── helm/

All S3 backup objects are:

Encrypted at rest using AWS KMS (per-tenant key).
Cross-region replicated to a secondary AWS region (configurable).
Versioned — accidental overwrites do not cause data loss.

Configuring Backup Settings

In your Helm values (client/values.yaml):

backup:
  enabled: true

  schedule:
    metadata: "0 2 * * *"       # 2:00 AM UTC
    pipelineState: "0 2 * * *"
    mlArtifacts: "0 3 * * *"
    ebsVolumes: "0 4 * * *"

  destination:
    s3:
      bucket: "calabi-backups-acme-corp"
      prefix: "backups/"
      region: "us-east-1"
      kmsKeyArn: "arn:aws:kms:us-east-1:123456789:key/abc-123"
      replicationRegion: "us-west-2"   # Cross-region replication target

  retention:
    dailyBackups: 30
    weeklyBackups: 12
    monthlyBackups: 24

RTO and RPO Targets

Scenario	Recovery Point Objective (RPO)	Recovery Time Objective (RTO)
Full platform restore (entire tenant)	24 hours (last daily backup)	2–4 hours
Point-in-time recovery (RDS only)	5 minutes (PITR window)	1–2 hours
Single table or chart restore	24 hours	30 minutes
Calabi Pipelines state restore	24 hours	1 hour
ML artifact restore	24 hours	30 minutes
Audit log restore	Near-zero (continuous export)	1 hour

RTO Assumes Healthy Infrastructure

RTO values assume the Kubernetes cluster and AWS infrastructure are healthy. Restoring to a new cluster (true disaster recovery) requires an additional 30–90 minutes for infrastructure provisioning.

Restore Procedures

Restore the Metadata Database (RDS)

From a daily snapshot:

Identify the snapshot to restore:

aws rds describe-db-snapshots \
  --db-instance-identifier calabi-<tenant-id>-rds \
  --query 'DBSnapshots[*].[DBSnapshotIdentifier,SnapshotCreateTime]' \
  --output table

Restore the snapshot to a new RDS instance:

aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier calabi-<tenant-id>-restore \
  --db-snapshot-identifier <snapshot-id> \
  --db-instance-class db.r6g.xlarge \
  --no-publicly-accessible

Once the new instance is available, update the Calabi Helm values to point to the restored instance:

database:
  managedRds: false
  external:
    host: "calabi-<tenant-id>-restore.us-east-1.rds.amazonaws.com"

Run helm upgrade with the updated values.

Point-in-time recovery:

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier calabi-<tenant-id>-rds \
  --target-db-instance-identifier calabi-<tenant-id>-pitr \
  --restore-time 2026-04-06T01:30:00Z   # The exact timestamp to restore to

Restore Calabi Catalogue Search Index

List available search index snapshots:

curl -X GET \
  "https://<search-endpoint>/_snapshot/calabi-s3-repo/_all" \
  -H "Content-Type: application/json"

Restore the snapshot:

curl -X POST \
  "https://<search-endpoint>/_snapshot/calabi-s3-repo/<snapshot-name>/_restore" \
  -H "Content-Type: application/json" \
  -d '{
    "indices": "calabi_catalogue",
    "ignore_unavailable": true,
    "include_global_state": false
  }'

Monitor restore progress:

curl "https://<search-endpoint>/_cat/recovery?v"

Restore Calabi Pipelines State

Stop the Calabi Pipelines scheduler to prevent new run attempts:

kubectl scale deploy/calabi-pipelines-scheduler \
  -n calabi-tenant-<id> --replicas=0

Restore the Calabi Pipelines database from the S3 backup:

# Download the latest backup
aws s3 cp \
  s3://calabi-backups-<tenant-id>/pipelines/$(date +%Y-%m-%d)/pipelines_db.sql.gz \
  /tmp/pipelines_db.sql.gz

# Restore to the database
gunzip -c /tmp/pipelines_db.sql.gz | \
  psql -h <rds-host> -U calabi -d calabi_pipelines

Restart the scheduler:

kubectl scale deploy/calabi-pipelines-scheduler \
  -n calabi-tenant-<id> --replicas=2

Restore ML Artifacts

ML artifacts are stored in a versioned S3 path. Restoring a specific experiment:

# List available artifact backups
aws s3 ls s3://calabi-backups-<tenant-id>/ml-artifacts/

# Sync artifacts from the backup date
aws s3 sync \
  s3://calabi-backups-<tenant-id>/ml-artifacts/2026-04-05/ \
  s3://calabi-<tenant-id>-mlartifacts/

For individual run artifact recovery, use the Calabi ML UI:

Navigate to Calabi ML → Experiments.
Select the experiment → the run.
The artifact list reflects the current S3 state. If artifacts are missing, restore the specific S3 prefix.

Verifying Backups

Backup health is reported in the Calabi monitoring dashboard:

Navigate to Admin → Monitoring → Backups.
The dashboard shows the last successful backup time and size for each component.
A failed backup is highlighted in red and generates an alert (if configured).

Manual Backup Verification

Run a verification test quarterly to confirm backups are restorable:

# Trigger a test restore of the latest RDS snapshot to a temporary instance
kubectl exec -n calabi-tenant-<id> job/calabi-backup-verify -- \
  python verify_restore.py --component rds --date $(date +%Y-%m-%d)

The verification job:

Restores the snapshot to a temporary RDS instance.
Runs a set of read queries against the restored database.
Compares row counts with the production database.
Destroys the temporary instance.
Reports success/failure to the monitoring dashboard and sends a Slack notification.

Disaster Recovery Checklist

Use this checklist when executing a full disaster recovery from a catastrophic failure:

Helm Configuration Reference — Configure backup settings in Helm
Platform Monitoring — Monitor backup job health and set up backup failure alerts
Multi-Tenancy — Per-tenant backup isolation

What Calabi Backs Up​

What Is NOT Backed Up​

Backup Frequency and Retention​

Automated Backup Schedule​

Backup Storage​

Configuring Backup Settings​

RTO and RPO Targets​

Restore Procedures​

Restore the Metadata Database (RDS)​

Restore Calabi Catalogue Search Index​

Restore Calabi Pipelines State​

Restore ML Artifacts​

Verifying Backups​

Manual Backup Verification​

Disaster Recovery Checklist​

Related Pages​