Skip to main content

Data Management Overview

All Tiers

Data management in Calabi covers the complete journey of data — from the moment it arrives from an external source, through transformation and quality validation, to its final governed state as a trusted, documented asset available to analysts and AI.

Every stage of this lifecycle is handled by a purpose-built Calabi component, and each component is designed to hand off cleanly to the next. Lineage, quality signals, and ownership metadata flow automatically from stage to stage — you never have to wire them up manually.


The Calabi Data Lifecycle

External Sources
DatabasesPostgres · MySQL · Oracle · SQL Server
SaaS APIsSalesforce · HubSpot · Stripe · Zendesk
Files & Object StorageCSV · Parquet · JSON · Google Sheets
Event StreamsKinesis · Kafka · SNS · EventBridge
Sync
Calabi ConnectIngestion Layer
90+ Connectors
Batch Sync
Change Data Capture
Real-time Streaming
Schema Evolution
Load
Raw StorageBronze Layer
S3 Data Lakebronze/ prefix · Parquet · partitioned
Data WarehouseRedshift · Snowflake · raw_* schemas
Transform
Calabi TransformSilver & Gold Layers
SQL Modelsstaging → marts
Column-level Testsnot_null · unique · referential
Lineage Trackingauto-generated from dbt graph
Incremental Buildsfast re-runs on changed data
Orchestrate
Calabi PipelinesOrchestration
DAG Schedulingcron · event · manual trigger
Dependency Managementtask dependencies · SLAs
Monitoring & AlertingSlack · PagerDuty · email
Retry & Backfillconfigurable retry policies
Validate
Data QualityTests · Profiling · Anomaly Detection
Automated Test Suitescolumn · row · referential checks
Data Profilingdistributions · null rates · cardinality
Anomaly DetectionML-based freshness & volume alerts
Quality Scoreper-table pass rate tracked over time
Catalogue
Calabi CatalogueGovernance & Discovery
Asset Discoverytables · dashboards · ML models · topics
Lineage Graphcolumn-level end-to-end provenance
Business Glossaryterms → assets · approval workflow
ClassificationPII · Sensitive · Certified · Golden
Data Productsdomain-scoped curated collections
End-to-end data lifecycle — every stage hands off lineage, quality signals, and ownership automatically

Data Ingestion — Calabi Connect

Starter+Professional+Enterprise

Calabi Connect is the ingestion layer of the platform. It connects to over 90 data sources — relational databases, cloud APIs, SaaS tools, files, and event streams — and loads data into your S3 data lake or warehouse.

Key capabilities

  • 90+ pre-built connectors — covers the most common enterprise data sources without custom code
  • Batch sync — scheduled full or incremental syncs on any cadence (hourly, daily, custom cron)
  • Change Data Capture (CDC) — real-time or near-real-time replication from databases that support log-based CDC (Postgres, MySQL, SQL Server, Oracle)
  • Schema evolution — automatically detects and propagates upstream schema changes without breaking pipelines
  • Per-connector configuration — normalisation, column selection, sync frequency, and destination schema all configurable per source
  • Data residency — all synced data lands in your AWS account; nothing passes through Calabi-operated infrastructure

Connector categories

CategoryExamples
Relational databasesPostgres, MySQL, MS SQL Server, Oracle, Snowflake, Redshift
NoSQL & document storesDynamoDB, MongoDB, Firestore
SaaS CRM & marketingSalesforce, HubSpot, Marketo, Intercom
SaaS financeStripe, QuickBooks, NetSuite, Chargebee
Productivity & filesGoogle Sheets, S3, SFTP, Azure Blob Storage
Event streamsKinesis, Kafka, Pub/Sub
Custom sourcesHTTP API (generic), JDBC, webhook

Go to Calabi Connect documentation →


Data Transformation — Calabi Transform

Starter+Professional+Enterprise

Calabi Transform is the SQL transformation layer. It provides a structured, version-controlled environment for authoring, testing, and deploying data models that turn raw Bronze-layer data into clean Silver and business-ready Gold assets.

Key capabilities

  • SQL-first modelling — write standard SQL SELECT statements; Calabi Transform handles materialisation (tables, views, incremental models)
  • Automated testing — built-in tests for uniqueness, not-null constraints, referential integrity, and custom SQL assertions
  • Column-level lineage — every model run pushes fine-grained lineage into Calabi Catalogue, showing which source columns feed which output columns
  • Documentation-as-code — model descriptions and column definitions live alongside SQL in version control, auto-published to Calabi Catalogue
  • Environments — separate dev, staging, and prod environments with schema isolation
  • Incremental models — efficiently process only new or changed records, not full table scans

Project structure

calabi-transform/
├── models/
│ ├── silver/ # Cleaned, conformed models
│ │ ├── commerce_clean/
│ │ │ ├── customers.sql
│ │ │ └── customers.yml
│ │ └── finance_clean/
│ └── gold/ # Business aggregates and marts
│ ├── commerce_mart/
│ │ ├── fact_orders.sql
│ │ └── dim_customer.sql
│ └── finance_mart/
├── tests/ # Custom SQL assertion tests
├── macros/ # Reusable SQL macros
└── dbt_project.yml

Go to Calabi Transform documentation →


Data Orchestration — Calabi Pipelines

Starter+Professional+Enterprise

Calabi Pipelines is the workflow orchestration engine. It schedules, executes, monitors, and alerts on all data pipeline DAGs — from ingestion triggers to transformation runs to ML model training jobs.

Key capabilities

  • Python-based DAG authoring — write pipelines as code with full Python expressiveness
  • Dependency management — define explicit task dependencies; tasks execute in the correct order with automatic retry on failure
  • Rich scheduling — cron expressions, data-aware scheduling (trigger when upstream dataset updates), or manual triggers
  • Pre-built operators — native operators for Calabi Connect syncs, Calabi Transform runs, S3 operations, Redshift queries, and AWS Glue
  • Monitoring & alerting — real-time DAG run status, task duration metrics, SLA miss detection, and email/Slack alerting on failure
  • Role-based DAG access — control which teams can view, trigger, or edit specific DAGs

Example DAG structure

# calabi-pipelines/dags/medallion_commerce_daily.py

from calabi.operators import CalabiConnectOperator, CalabiTransformOperator

with DAG("medallion_commerce_daily", schedule="0 6 * * *", catchup=False):

ingest = CalabiConnectOperator(
task_id="sync_all_commerce_sources",
connection_ids=["salesforce_prod", "stripe_prod", "postgres_orders"],
)

transform_silver = CalabiTransformOperator(
task_id="run_silver_models",
select="tag:silver,tag:commerce",
test=True, # Block Gold if Silver tests fail
)

transform_gold = CalabiTransformOperator(
task_id="run_gold_models",
select="tag:gold,tag:commerce",
test=True,
)

ingest >> transform_silver >> transform_gold

Go to Calabi Pipelines documentation →


Data Quality

Professional+Enterprise

Data quality in Calabi operates at two levels: inline tests embedded in Calabi Transform models (run at pipeline time), and continuous quality monitoring via profiling and anomaly detection on live tables.

Inline testing (pipeline-time)

Every Calabi Transform model can declare tests in its .yml configuration. Tests run automatically after each transformation job. Failures block downstream models from running and surface alerts in Calabi Pipelines.

Built-in test types:

TestWhat it checks
not_nullColumn contains no null values
uniqueColumn values are distinct
accepted_valuesColumn only contains values from a defined list
relationshipsForeign key exists in the referenced table
Custom SQLAny business logic expressible as a SQL WHERE clause

Continuous quality monitoring (Calabi Catalogue)

Beyond pipeline-time tests, Calabi Catalogue runs scheduled table profiling jobs that compute:

  • Row counts and trends over time
  • Null rate per column
  • Cardinality and value distributions
  • Min / max / mean / percentile statistics
  • Schema change detection (new columns, dropped columns, type changes)

Anomaly detection compares each profiling run against historical baselines and raises alerts when metrics deviate beyond configured thresholds — for example, if daily row count drops by more than 20% compared to the trailing 7-day average.

Quality scores

Calabi Catalogue assigns a data quality score to each table, combining:

  • Test pass rate (inline and scheduled)
  • Profiling freshness
  • Documentation completeness
  • Ownership assignment

Scores are visible in search results, data product pages, and the CalabiIQ quality dashboard.


Data Governance — Calabi Catalogue

All Tiers

Calabi Catalogue is the unified governance surface for all data assets in the platform. It is automatically populated by metadata from Calabi Connect, Calabi Transform, Calabi Pipelines, and CalabiIQ — no manual registration required.

Key capabilities

Discovery

  • Full-text search across all tables, columns, dashboards, pipelines, and ML models
  • Filter by domain, owner, tag, classification, quality score, or data tier
  • Recently viewed and most popular assets surfaced on the home page

Lineage

  • Automatic column-level lineage from Calabi Transform runs
  • End-to-end lineage from source system → Bronze → Silver → Gold → Dashboard
  • Impact analysis: "If I change this column, what downstream assets break?"

Business Glossary

  • Canonical definitions for business terms (e.g., "Monthly Active User", "Net Revenue")
  • Terms linked to the specific columns and tables that implement them
  • Approved by Data Stewards; versioned and auditable

Data Classification

  • Tag columns with sensitivity labels: PII, Confidential, Internal, Public
  • Classification tags drive access policies and masking rules automatically
  • Auto-classification suggestions using pattern matching on column names and sample values

Data Products

  • Curate collections of related tables, dashboards, and metrics into a Data Product
  • Assign an owner team and SLA
  • Enable self-service access requests, approved by the owning Data Steward

Go to Calabi Catalogue documentation →


Putting It All Together

A complete, production-grade data pipeline in Calabi looks like this:

  1. Calabi Connect syncs raw data from Salesforce, Stripe, and Postgres into raw_* schemas (Bronze)
  2. Calabi Pipelines triggers the Calabi Transform silver run; Silver models clean, deduplicate, and conform the data into *_clean schemas
  3. Silver tests run — if any fail, the Gold run is blocked and an alert fires
  4. Gold models build *_mart schemas with business facts, dimensions, and aggregates
  5. Calabi Catalogue automatically indexes the new Gold tables, traces lineage back to Bronze, and updates quality scores
  6. CalabiIQ dashboards connected to *_mart schemas refresh with the latest data
  7. Calabi AI Agent can answer natural language questions using the trusted Gold layer
  8. The owning Data Steward receives a quality score summary and approves the Data Product for self-service access

What's Next