Backup & Recovery
Calabi performs automated daily backups of all critical platform data. This page documents what is backed up, backup frequency and retention, how to restore from backup, RTO/RPO targets, and a disaster recovery checklist.
What Calabi Backs Up
Calabi's backup system protects all stateful components of the platform. The following table details every backed-up component and its importance.
| Component | What Is Backed Up | Storage Location | Criticality |
|---|---|---|---|
| Metadata Database (RDS) | All Calabi configuration, user accounts, roles, Calabi Catalogue asset metadata, CalabiIQ charts/dashboards, Calabi Automate workflows and credentials, AI Builder chatflows | S3 (encrypted) | Critical |
| Calabi Catalogue Search Index | Search index of all catalogued assets, tags, lineage, and quality results | S3 snapshot | High |
| Calabi Pipelines State | DAG definitions, connection configurations, variables, Calabi Pipelines database (run history, task instances, XComs) | S3 (encrypted) | High |
| Calabi ML Artifacts | Experiment metadata, model weights, plots, and data samples stored in the Calabi ML artifact store | S3 (versioned) | High |
| AI Builder Vector Stores | All pgvector embeddings and document chunks used by RAG agents | RDS snapshot + S3 | High |
| Kubernetes Persistent Volumes | Local model cache, Redis persistence files | EBS snapshots | Medium |
| Helm Configuration | Deployed Helm values files (per-tenant) | S3 (versioned) | Medium |
| Audit Logs | Complete user activity audit log (separate long-retention path) | CloudWatch Logs + S3 | Compliance |
What Is NOT Backed Up
| Component | Reason |
|---|---|
| Source data in your data warehouse | Your warehouse (Redshift, Snowflake, BigQuery, etc.) has its own backup mechanisms outside of Calabi |
| Raw files in user-managed S3 buckets | Not managed by Calabi |
| Calabi Connect sync state (in-flight records) | Connectors re-sync from source on restart; data is not lost |
Backup Frequency and Retention
Automated Backup Schedule
| Component | Schedule | Retention |
|---|---|---|
| RDS (full snapshot) | Daily at 02:00 UTC | 30 daily, 12 weekly, 24 monthly |
| RDS (transaction logs) | Continuous (5-minute intervals) | 7 days (enables point-in-time recovery) |
| Search index snapshot | Daily at 02:30 UTC | 14 days |
| Calabi Pipelines DB | Daily at 02:00 UTC | 30 days |
| ML artifact store | Daily at 03:00 UTC | 90 days |
| EBS volumes | Daily at 04:00 UTC | 7 days |
| Helm configuration | On every Helm deployment | Unlimited (versioned S3) |
| Audit logs (CloudWatch → S3) | Continuous export | 7 years |
Backup Storage
All backups are stored in a dedicated S3 bucket per tenant:
s3://calabi-backups-<tenant-id>/
├── rds/
│ ├── snapshots/
│ └── pitr/
├── search-index/
├── pipelines/
├── ml-artifacts/
├── ebs-snapshots/
└── helm/
All S3 backup objects are:
- Encrypted at rest using AWS KMS (per-tenant key).
- Cross-region replicated to a secondary AWS region (configurable).
- Versioned — accidental overwrites do not cause data loss.
Configuring Backup Settings
In your Helm values (client/values.yaml):
backup:
enabled: true
schedule:
metadata: "0 2 * * *" # 2:00 AM UTC
pipelineState: "0 2 * * *"
mlArtifacts: "0 3 * * *"
ebsVolumes: "0 4 * * *"
destination:
s3:
bucket: "calabi-backups-acme-corp"
prefix: "backups/"
region: "us-east-1"
kmsKeyArn: "arn:aws:kms:us-east-1:123456789:key/abc-123"
replicationRegion: "us-west-2" # Cross-region replication target
retention:
dailyBackups: 30
weeklyBackups: 12
monthlyBackups: 24
RTO and RPO Targets
| Scenario | Recovery Point Objective (RPO) | Recovery Time Objective (RTO) |
|---|---|---|
| Full platform restore (entire tenant) | 24 hours (last daily backup) | 2–4 hours |
| Point-in-time recovery (RDS only) | 5 minutes (PITR window) | 1–2 hours |
| Single table or chart restore | 24 hours | 30 minutes |
| Calabi Pipelines state restore | 24 hours | 1 hour |
| ML artifact restore | 24 hours | 30 minutes |
| Audit log restore | Near-zero (continuous export) | 1 hour |
RTO values assume the Kubernetes cluster and AWS infrastructure are healthy. Restoring to a new cluster (true disaster recovery) requires an additional 30–90 minutes for infrastructure provisioning.
Restore Procedures
Restore the Metadata Database (RDS)
From a daily snapshot:
- Identify the snapshot to restore:
aws rds describe-db-snapshots \
--db-instance-identifier calabi-<tenant-id>-rds \
--query 'DBSnapshots[*].[DBSnapshotIdentifier,SnapshotCreateTime]' \
--output table - Restore the snapshot to a new RDS instance:
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier calabi-<tenant-id>-restore \
--db-snapshot-identifier <snapshot-id> \
--db-instance-class db.r6g.xlarge \
--no-publicly-accessible - Once the new instance is available, update the Calabi Helm values to point to the restored instance:
database:
managedRds: false
external:
host: "calabi-<tenant-id>-restore.us-east-1.rds.amazonaws.com" - Run
helm upgradewith the updated values.
Point-in-time recovery:
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier calabi-<tenant-id>-rds \
--target-db-instance-identifier calabi-<tenant-id>-pitr \
--restore-time 2026-04-06T01:30:00Z # The exact timestamp to restore to
Restore Calabi Catalogue Search Index
- List available search index snapshots:
curl -X GET \
"https://<search-endpoint>/_snapshot/calabi-s3-repo/_all" \
-H "Content-Type: application/json" - Restore the snapshot:
curl -X POST \
"https://<search-endpoint>/_snapshot/calabi-s3-repo/<snapshot-name>/_restore" \
-H "Content-Type: application/json" \
-d '{
"indices": "calabi_catalogue",
"ignore_unavailable": true,
"include_global_state": false
}' - Monitor restore progress:
curl "https://<search-endpoint>/_cat/recovery?v"
Restore Calabi Pipelines State
- Stop the Calabi Pipelines scheduler to prevent new run attempts:
kubectl scale deploy/calabi-pipelines-scheduler \
-n calabi-tenant-<id> --replicas=0 - Restore the Calabi Pipelines database from the S3 backup:
# Download the latest backup
aws s3 cp \
s3://calabi-backups-<tenant-id>/pipelines/$(date +%Y-%m-%d)/pipelines_db.sql.gz \
/tmp/pipelines_db.sql.gz
# Restore to the database
gunzip -c /tmp/pipelines_db.sql.gz | \
psql -h <rds-host> -U calabi -d calabi_pipelines - Restart the scheduler:
kubectl scale deploy/calabi-pipelines-scheduler \
-n calabi-tenant-<id> --replicas=2
Restore ML Artifacts
ML artifacts are stored in a versioned S3 path. Restoring a specific experiment:
# List available artifact backups
aws s3 ls s3://calabi-backups-<tenant-id>/ml-artifacts/
# Sync artifacts from the backup date
aws s3 sync \
s3://calabi-backups-<tenant-id>/ml-artifacts/2026-04-05/ \
s3://calabi-<tenant-id>-mlartifacts/
For individual run artifact recovery, use the Calabi ML UI:
- Navigate to Calabi ML → Experiments.
- Select the experiment → the run.
- The artifact list reflects the current S3 state. If artifacts are missing, restore the specific S3 prefix.
Verifying Backups
Backup health is reported in the Calabi monitoring dashboard:
- Navigate to Admin → Monitoring → Backups.
- The dashboard shows the last successful backup time and size for each component.
- A failed backup is highlighted in red and generates an alert (if configured).
Manual Backup Verification
Run a verification test quarterly to confirm backups are restorable:
# Trigger a test restore of the latest RDS snapshot to a temporary instance
kubectl exec -n calabi-tenant-<id> job/calabi-backup-verify -- \
python verify_restore.py --component rds --date $(date +%Y-%m-%d)
The verification job:
- Restores the snapshot to a temporary RDS instance.
- Runs a set of read queries against the restored database.
- Compares row counts with the production database.
- Destroys the temporary instance.
- Reports success/failure to the monitoring dashboard and sends a Slack notification.
Disaster Recovery Checklist
Use this checklist when executing a full disaster recovery from a catastrophic failure:
- Assess the failure — determine which components are affected and the last known-good state.
- Identify the recovery point — choose the appropriate backup date/time. For PITR, identify the exact timestamp.
- Provision infrastructure (if needed) — provision a new Kubernetes cluster and RDS/ElastiCache instances using the Calabi Terraform modules.
- Restore metadata database — RDS snapshot or PITR restore (procedure above).
- Restore Calabi Pipelines state — from S3 backup (procedure above).
- Restore search index — from snapshot (procedure above).
- Restore ML artifacts — S3 sync from backup (procedure above).
- Update Helm values — point to restored database instances; verify all settings.
- Deploy Calabi — run
helm upgradewith updated values. - Verify all services — check the Health Dashboard; all services should show green.
- Smoke test key workflows — verify CalabiIQ loads, Calabi Catalogue search works, Calabi Pipelines scheduler is active.
- Notify stakeholders — communicate the recovery status and any data loss window (RPO impact).
- Post-incident review — document what happened, root cause, and preventive measures.
Related Pages
- Helm Configuration Reference — Configure backup settings in Helm
- Platform Monitoring — Monitor backup job health and set up backup failure alerts
- Multi-Tenancy — Per-tenant backup isolation