Skip to main content

Backup & Recovery

All Tiers

Calabi performs automated daily backups of all critical platform data. This page documents what is backed up, backup frequency and retention, how to restore from backup, RTO/RPO targets, and a disaster recovery checklist.


What Calabi Backs Up

Calabi's backup system protects all stateful components of the platform. The following table details every backed-up component and its importance.

ComponentWhat Is Backed UpStorage LocationCriticality
Metadata Database (RDS)All Calabi configuration, user accounts, roles, Calabi Catalogue asset metadata, CalabiIQ charts/dashboards, Calabi Automate workflows and credentials, AI Builder chatflowsS3 (encrypted)Critical
Calabi Catalogue Search IndexSearch index of all catalogued assets, tags, lineage, and quality resultsS3 snapshotHigh
Calabi Pipelines StateDAG definitions, connection configurations, variables, Calabi Pipelines database (run history, task instances, XComs)S3 (encrypted)High
Calabi ML ArtifactsExperiment metadata, model weights, plots, and data samples stored in the Calabi ML artifact storeS3 (versioned)High
AI Builder Vector StoresAll pgvector embeddings and document chunks used by RAG agentsRDS snapshot + S3High
Kubernetes Persistent VolumesLocal model cache, Redis persistence filesEBS snapshotsMedium
Helm ConfigurationDeployed Helm values files (per-tenant)S3 (versioned)Medium
Audit LogsComplete user activity audit log (separate long-retention path)CloudWatch Logs + S3Compliance

What Is NOT Backed Up

ComponentReason
Source data in your data warehouseYour warehouse (Redshift, Snowflake, BigQuery, etc.) has its own backup mechanisms outside of Calabi
Raw files in user-managed S3 bucketsNot managed by Calabi
Calabi Connect sync state (in-flight records)Connectors re-sync from source on restart; data is not lost

Backup Frequency and Retention

Automated Backup Schedule

ComponentScheduleRetention
RDS (full snapshot)Daily at 02:00 UTC30 daily, 12 weekly, 24 monthly
RDS (transaction logs)Continuous (5-minute intervals)7 days (enables point-in-time recovery)
Search index snapshotDaily at 02:30 UTC14 days
Calabi Pipelines DBDaily at 02:00 UTC30 days
ML artifact storeDaily at 03:00 UTC90 days
EBS volumesDaily at 04:00 UTC7 days
Helm configurationOn every Helm deploymentUnlimited (versioned S3)
Audit logs (CloudWatch → S3)Continuous export7 years

Backup Storage

All backups are stored in a dedicated S3 bucket per tenant:

s3://calabi-backups-<tenant-id>/
├── rds/
│ ├── snapshots/
│ └── pitr/
├── search-index/
├── pipelines/
├── ml-artifacts/
├── ebs-snapshots/
└── helm/

All S3 backup objects are:

  • Encrypted at rest using AWS KMS (per-tenant key).
  • Cross-region replicated to a secondary AWS region (configurable).
  • Versioned — accidental overwrites do not cause data loss.

Configuring Backup Settings

In your Helm values (client/values.yaml):

backup:
enabled: true

schedule:
metadata: "0 2 * * *" # 2:00 AM UTC
pipelineState: "0 2 * * *"
mlArtifacts: "0 3 * * *"
ebsVolumes: "0 4 * * *"

destination:
s3:
bucket: "calabi-backups-acme-corp"
prefix: "backups/"
region: "us-east-1"
kmsKeyArn: "arn:aws:kms:us-east-1:123456789:key/abc-123"
replicationRegion: "us-west-2" # Cross-region replication target

retention:
dailyBackups: 30
weeklyBackups: 12
monthlyBackups: 24

RTO and RPO Targets

ScenarioRecovery Point Objective (RPO)Recovery Time Objective (RTO)
Full platform restore (entire tenant)24 hours (last daily backup)2–4 hours
Point-in-time recovery (RDS only)5 minutes (PITR window)1–2 hours
Single table or chart restore24 hours30 minutes
Calabi Pipelines state restore24 hours1 hour
ML artifact restore24 hours30 minutes
Audit log restoreNear-zero (continuous export)1 hour
RTO Assumes Healthy Infrastructure

RTO values assume the Kubernetes cluster and AWS infrastructure are healthy. Restoring to a new cluster (true disaster recovery) requires an additional 30–90 minutes for infrastructure provisioning.


Restore Procedures

Restore the Metadata Database (RDS)

From a daily snapshot:

  1. Identify the snapshot to restore:
    aws rds describe-db-snapshots \
    --db-instance-identifier calabi-<tenant-id>-rds \
    --query 'DBSnapshots[*].[DBSnapshotIdentifier,SnapshotCreateTime]' \
    --output table
  2. Restore the snapshot to a new RDS instance:
    aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier calabi-<tenant-id>-restore \
    --db-snapshot-identifier <snapshot-id> \
    --db-instance-class db.r6g.xlarge \
    --no-publicly-accessible
  3. Once the new instance is available, update the Calabi Helm values to point to the restored instance:
    database:
    managedRds: false
    external:
    host: "calabi-<tenant-id>-restore.us-east-1.rds.amazonaws.com"
  4. Run helm upgrade with the updated values.

Point-in-time recovery:

aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier calabi-<tenant-id>-rds \
--target-db-instance-identifier calabi-<tenant-id>-pitr \
--restore-time 2026-04-06T01:30:00Z # The exact timestamp to restore to

Restore Calabi Catalogue Search Index

  1. List available search index snapshots:
    curl -X GET \
    "https://<search-endpoint>/_snapshot/calabi-s3-repo/_all" \
    -H "Content-Type: application/json"
  2. Restore the snapshot:
    curl -X POST \
    "https://<search-endpoint>/_snapshot/calabi-s3-repo/<snapshot-name>/_restore" \
    -H "Content-Type: application/json" \
    -d '{
    "indices": "calabi_catalogue",
    "ignore_unavailable": true,
    "include_global_state": false
    }'
  3. Monitor restore progress:
    curl "https://<search-endpoint>/_cat/recovery?v"

Restore Calabi Pipelines State

  1. Stop the Calabi Pipelines scheduler to prevent new run attempts:
    kubectl scale deploy/calabi-pipelines-scheduler \
    -n calabi-tenant-<id> --replicas=0
  2. Restore the Calabi Pipelines database from the S3 backup:
    # Download the latest backup
    aws s3 cp \
    s3://calabi-backups-<tenant-id>/pipelines/$(date +%Y-%m-%d)/pipelines_db.sql.gz \
    /tmp/pipelines_db.sql.gz

    # Restore to the database
    gunzip -c /tmp/pipelines_db.sql.gz | \
    psql -h <rds-host> -U calabi -d calabi_pipelines
  3. Restart the scheduler:
    kubectl scale deploy/calabi-pipelines-scheduler \
    -n calabi-tenant-<id> --replicas=2

Restore ML Artifacts

ML artifacts are stored in a versioned S3 path. Restoring a specific experiment:

# List available artifact backups
aws s3 ls s3://calabi-backups-<tenant-id>/ml-artifacts/

# Sync artifacts from the backup date
aws s3 sync \
s3://calabi-backups-<tenant-id>/ml-artifacts/2026-04-05/ \
s3://calabi-<tenant-id>-mlartifacts/

For individual run artifact recovery, use the Calabi ML UI:

  1. Navigate to Calabi MLExperiments.
  2. Select the experiment → the run.
  3. The artifact list reflects the current S3 state. If artifacts are missing, restore the specific S3 prefix.

Verifying Backups

Backup health is reported in the Calabi monitoring dashboard:

  1. Navigate to AdminMonitoringBackups.
  2. The dashboard shows the last successful backup time and size for each component.
  3. A failed backup is highlighted in red and generates an alert (if configured).

Manual Backup Verification

Run a verification test quarterly to confirm backups are restorable:

# Trigger a test restore of the latest RDS snapshot to a temporary instance
kubectl exec -n calabi-tenant-<id> job/calabi-backup-verify -- \
python verify_restore.py --component rds --date $(date +%Y-%m-%d)

The verification job:

  1. Restores the snapshot to a temporary RDS instance.
  2. Runs a set of read queries against the restored database.
  3. Compares row counts with the production database.
  4. Destroys the temporary instance.
  5. Reports success/failure to the monitoring dashboard and sends a Slack notification.

Disaster Recovery Checklist

Use this checklist when executing a full disaster recovery from a catastrophic failure:

  • Assess the failure — determine which components are affected and the last known-good state.
  • Identify the recovery point — choose the appropriate backup date/time. For PITR, identify the exact timestamp.
  • Provision infrastructure (if needed) — provision a new Kubernetes cluster and RDS/ElastiCache instances using the Calabi Terraform modules.
  • Restore metadata database — RDS snapshot or PITR restore (procedure above).
  • Restore Calabi Pipelines state — from S3 backup (procedure above).
  • Restore search index — from snapshot (procedure above).
  • Restore ML artifacts — S3 sync from backup (procedure above).
  • Update Helm values — point to restored database instances; verify all settings.
  • Deploy Calabi — run helm upgrade with updated values.
  • Verify all services — check the Health Dashboard; all services should show green.
  • Smoke test key workflows — verify CalabiIQ loads, Calabi Catalogue search works, Calabi Pipelines scheduler is active.
  • Notify stakeholders — communicate the recovery status and any data loss window (RPO impact).
  • Post-incident review — document what happened, root cause, and preventive measures.