Data Management Overview

All Tiers

Data management in Calabi covers the complete journey of data — from the moment it arrives from an external source, through transformation and quality validation, to its final governed state as a trusted, documented asset available to analysts and AI.

Every stage of this lifecycle is handled by a purpose-built Calabi component, and each component is designed to hand off cleanly to the next. Lineage, quality signals, and ownership metadata flow automatically from stage to stage — you never have to wire them up manually.

The Calabi Data Lifecycle

External Sources

DatabasesPostgres · MySQL · Oracle · SQL Server

SaaS APIsSalesforce · HubSpot · Stripe · Zendesk

Files & Object StorageCSV · Parquet · JSON · Google Sheets

Event StreamsKinesis · Kafka · SNS · EventBridge

Sync

Calabi ConnectIngestion Layer

90+ Connectors

Batch Sync

Change Data Capture

Real-time Streaming

Schema Evolution

Load

Raw StorageBronze Layer

S3 Data Lakebronze/ prefix · Parquet · partitioned

Data WarehouseRedshift · Snowflake · raw_* schemas

Transform

Calabi TransformSilver & Gold Layers

SQL Modelsstaging → marts

Column-level Testsnot_null · unique · referential

Lineage Trackingauto-generated from dbt graph

Incremental Buildsfast re-runs on changed data

Orchestrate

Calabi PipelinesOrchestration

DAG Schedulingcron · event · manual trigger

Dependency Managementtask dependencies · SLAs

Monitoring & AlertingSlack · PagerDuty · email

Retry & Backfillconfigurable retry policies

Validate

Data QualityTests · Profiling · Anomaly Detection

Automated Test Suitescolumn · row · referential checks

Data Profilingdistributions · null rates · cardinality

Anomaly DetectionML-based freshness & volume alerts

Quality Scoreper-table pass rate tracked over time

Catalogue

Calabi CatalogueGovernance & Discovery

Asset Discoverytables · dashboards · ML models · topics

Lineage Graphcolumn-level end-to-end provenance

Business Glossaryterms → assets · approval workflow

ClassificationPII · Sensitive · Certified · Golden

Data Productsdomain-scoped curated collections

End-to-end data lifecycle — every stage hands off lineage, quality signals, and ownership automatically

Data Ingestion — Calabi Connect

Starter+Professional+Enterprise

Calabi Connect is the ingestion layer of the platform. It connects to over 90 data sources — relational databases, cloud APIs, SaaS tools, files, and event streams — and loads data into your S3 data lake or warehouse.

Key capabilities

90+ pre-built connectors — covers the most common enterprise data sources without custom code
Batch sync — scheduled full or incremental syncs on any cadence (hourly, daily, custom cron)
Change Data Capture (CDC) — real-time or near-real-time replication from databases that support log-based CDC (Postgres, MySQL, SQL Server, Oracle)
Schema evolution — automatically detects and propagates upstream schema changes without breaking pipelines
Per-connector configuration — normalisation, column selection, sync frequency, and destination schema all configurable per source
Data residency — all synced data lands in your AWS account; nothing passes through Calabi-operated infrastructure

Connector categories

Category	Examples
Relational databases	Postgres, MySQL, MS SQL Server, Oracle, Snowflake, Redshift
NoSQL & document stores	DynamoDB, MongoDB, Firestore
SaaS CRM & marketing	Salesforce, HubSpot, Marketo, Intercom
SaaS finance	Stripe, QuickBooks, NetSuite, Chargebee
Productivity & files	Google Sheets, S3, SFTP, Azure Blob Storage
Event streams	Kinesis, Kafka, Pub/Sub
Custom sources	HTTP API (generic), JDBC, webhook

Go to Calabi Connect documentation →

Data Transformation — Calabi Transform

Starter+Professional+Enterprise

Calabi Transform is the SQL transformation layer. It provides a structured, version-controlled environment for authoring, testing, and deploying data models that turn raw Bronze-layer data into clean Silver and business-ready Gold assets.

Key capabilities

SQL-first modelling — write standard SQL SELECT statements; Calabi Transform handles materialisation (tables, views, incremental models)
Automated testing — built-in tests for uniqueness, not-null constraints, referential integrity, and custom SQL assertions
Column-level lineage — every model run pushes fine-grained lineage into Calabi Catalogue, showing which source columns feed which output columns
Documentation-as-code — model descriptions and column definitions live alongside SQL in version control, auto-published to Calabi Catalogue
Environments — separate dev, staging, and prod environments with schema isolation
Incremental models — efficiently process only new or changed records, not full table scans

Project structure

calabi-transform/
├── models/
│   ├── silver/              # Cleaned, conformed models
│   │   ├── commerce_clean/
│   │   │   ├── customers.sql
│   │   │   └── customers.yml
│   │   └── finance_clean/
│   └── gold/               # Business aggregates and marts
│       ├── commerce_mart/
│       │   ├── fact_orders.sql
│       │   └── dim_customer.sql
│       └── finance_mart/
├── tests/                   # Custom SQL assertion tests
├── macros/                  # Reusable SQL macros
└── dbt_project.yml

Go to Calabi Transform documentation →

Data Orchestration — Calabi Pipelines

Starter+Professional+Enterprise

Calabi Pipelines is the workflow orchestration engine. It schedules, executes, monitors, and alerts on all data pipeline DAGs — from ingestion triggers to transformation runs to ML model training jobs.

Key capabilities

Python-based DAG authoring — write pipelines as code with full Python expressiveness
Dependency management — define explicit task dependencies; tasks execute in the correct order with automatic retry on failure
Rich scheduling — cron expressions, data-aware scheduling (trigger when upstream dataset updates), or manual triggers
Pre-built operators — native operators for Calabi Connect syncs, Calabi Transform runs, S3 operations, Redshift queries, and AWS Glue
Monitoring & alerting — real-time DAG run status, task duration metrics, SLA miss detection, and email/Slack alerting on failure
Role-based DAG access — control which teams can view, trigger, or edit specific DAGs

Example DAG structure

# calabi-pipelines/dags/medallion_commerce_daily.py

from calabi.operators import CalabiConnectOperator, CalabiTransformOperator

with DAG("medallion_commerce_daily", schedule="0 6 * * *", catchup=False):

    ingest = CalabiConnectOperator(
        task_id="sync_all_commerce_sources",
        connection_ids=["salesforce_prod", "stripe_prod", "postgres_orders"],
    )

    transform_silver = CalabiTransformOperator(
        task_id="run_silver_models",
        select="tag:silver,tag:commerce",
        test=True,  # Block Gold if Silver tests fail
    )

    transform_gold = CalabiTransformOperator(
        task_id="run_gold_models",
        select="tag:gold,tag:commerce",
        test=True,
    )

    ingest >> transform_silver >> transform_gold

Go to Calabi Pipelines documentation →

Data Quality

Professional+Enterprise

Data quality in Calabi operates at two levels: inline tests embedded in Calabi Transform models (run at pipeline time), and continuous quality monitoring via profiling and anomaly detection on live tables.

Inline testing (pipeline-time)

Every Calabi Transform model can declare tests in its .yml configuration. Tests run automatically after each transformation job. Failures block downstream models from running and surface alerts in Calabi Pipelines.

Built-in test types:

Test	What it checks
`not_null`	Column contains no null values
`unique`	Column values are distinct
`accepted_values`	Column only contains values from a defined list
`relationships`	Foreign key exists in the referenced table
Custom SQL	Any business logic expressible as a SQL `WHERE` clause

Continuous quality monitoring (Calabi Catalogue)

Beyond pipeline-time tests, Calabi Catalogue runs scheduled table profiling jobs that compute:

Row counts and trends over time
Null rate per column
Cardinality and value distributions
Min / max / mean / percentile statistics
Schema change detection (new columns, dropped columns, type changes)

Anomaly detection compares each profiling run against historical baselines and raises alerts when metrics deviate beyond configured thresholds — for example, if daily row count drops by more than 20% compared to the trailing 7-day average.

Quality scores

Calabi Catalogue assigns a data quality score to each table, combining:

Test pass rate (inline and scheduled)
Profiling freshness
Documentation completeness
Ownership assignment

Scores are visible in search results, data product pages, and the CalabiIQ quality dashboard.

Data Governance — Calabi Catalogue

All Tiers

Calabi Catalogue is the unified governance surface for all data assets in the platform. It is automatically populated by metadata from Calabi Connect, Calabi Transform, Calabi Pipelines, and CalabiIQ — no manual registration required.

Key capabilities

Discovery

Full-text search across all tables, columns, dashboards, pipelines, and ML models
Filter by domain, owner, tag, classification, quality score, or data tier
Recently viewed and most popular assets surfaced on the home page

Lineage

Automatic column-level lineage from Calabi Transform runs
End-to-end lineage from source system → Bronze → Silver → Gold → Dashboard
Impact analysis: "If I change this column, what downstream assets break?"

Business Glossary

Canonical definitions for business terms (e.g., "Monthly Active User", "Net Revenue")
Terms linked to the specific columns and tables that implement them
Approved by Data Stewards; versioned and auditable

Data Classification

Tag columns with sensitivity labels: PII, Confidential, Internal, Public
Classification tags drive access policies and masking rules automatically
Auto-classification suggestions using pattern matching on column names and sample values

Data Products

Curate collections of related tables, dashboards, and metrics into a Data Product
Assign an owner team and SLA
Enable self-service access requests, approved by the owning Data Steward

Go to Calabi Catalogue documentation →

Putting It All Together

A complete, production-grade data pipeline in Calabi looks like this:

Calabi Connect syncs raw data from Salesforce, Stripe, and Postgres into raw_* schemas (Bronze)
Calabi Pipelines triggers the Calabi Transform silver run; Silver models clean, deduplicate, and conform the data into *_clean schemas
Silver tests run — if any fail, the Gold run is blocked and an alert fires
Gold models build *_mart schemas with business facts, dimensions, and aggregates
Calabi Catalogue automatically indexes the new Gold tables, traces lineage back to Bronze, and updates quality scores
CalabiIQ dashboards connected to *_mart schemas refresh with the latest data
Calabi AI Agent can answer natural language questions using the trusted Gold layer
The owning Data Steward receives a quality score summary and approves the Data Product for self-service access

What's Next

Calabi Connect — Configure your first data source connection
Calabi Transform — Author your first Silver and Gold models
Calabi Pipelines — Schedule your end-to-end medallion pipeline
Calabi Catalogue — Explore, govern, and document your data assets
Medallion Architecture — Best practices for Bronze, Silver, and Gold layer design

The Calabi Data Lifecycle​

Data Ingestion — Calabi Connect​

Key capabilities​

Connector categories​

Data Transformation — Calabi Transform​

Key capabilities​

Project structure​

Data Orchestration — Calabi Pipelines​

Key capabilities​

Example DAG structure​

Data Quality​

Inline testing (pipeline-time)​

Continuous quality monitoring (Calabi Catalogue)​

Quality scores​

Data Governance — Calabi Catalogue​

Key capabilities​

Putting It All Together​

What's Next​

The Calabi Data Lifecycle

Data Ingestion — Calabi Connect

Key capabilities

Connector categories

Data Transformation — Calabi Transform

Key capabilities

Project structure

Data Orchestration — Calabi Pipelines

Key capabilities

Example DAG structure

Data Quality

Inline testing (pipeline-time)

Continuous quality monitoring (Calabi Catalogue)

Quality scores

Data Governance — Calabi Catalogue

Key capabilities

Putting It All Together

What's Next