Data Management Overview
Data management in Calabi covers the complete journey of data — from the moment it arrives from an external source, through transformation and quality validation, to its final governed state as a trusted, documented asset available to analysts and AI.
Every stage of this lifecycle is handled by a purpose-built Calabi component, and each component is designed to hand off cleanly to the next. Lineage, quality signals, and ownership metadata flow automatically from stage to stage — you never have to wire them up manually.
The Calabi Data Lifecycle
Data Ingestion — Calabi Connect
Calabi Connect is the ingestion layer of the platform. It connects to over 90 data sources — relational databases, cloud APIs, SaaS tools, files, and event streams — and loads data into your S3 data lake or warehouse.
Key capabilities
- 90+ pre-built connectors — covers the most common enterprise data sources without custom code
- Batch sync — scheduled full or incremental syncs on any cadence (hourly, daily, custom cron)
- Change Data Capture (CDC) — real-time or near-real-time replication from databases that support log-based CDC (Postgres, MySQL, SQL Server, Oracle)
- Schema evolution — automatically detects and propagates upstream schema changes without breaking pipelines
- Per-connector configuration — normalisation, column selection, sync frequency, and destination schema all configurable per source
- Data residency — all synced data lands in your AWS account; nothing passes through Calabi-operated infrastructure
Connector categories
| Category | Examples |
|---|---|
| Relational databases | Postgres, MySQL, MS SQL Server, Oracle, Snowflake, Redshift |
| NoSQL & document stores | DynamoDB, MongoDB, Firestore |
| SaaS CRM & marketing | Salesforce, HubSpot, Marketo, Intercom |
| SaaS finance | Stripe, QuickBooks, NetSuite, Chargebee |
| Productivity & files | Google Sheets, S3, SFTP, Azure Blob Storage |
| Event streams | Kinesis, Kafka, Pub/Sub |
| Custom sources | HTTP API (generic), JDBC, webhook |
Go to Calabi Connect documentation →
Data Transformation — Calabi Transform
Calabi Transform is the SQL transformation layer. It provides a structured, version-controlled environment for authoring, testing, and deploying data models that turn raw Bronze-layer data into clean Silver and business-ready Gold assets.
Key capabilities
- SQL-first modelling — write standard SQL
SELECTstatements; Calabi Transform handles materialisation (tables, views, incremental models) - Automated testing — built-in tests for uniqueness, not-null constraints, referential integrity, and custom SQL assertions
- Column-level lineage — every model run pushes fine-grained lineage into Calabi Catalogue, showing which source columns feed which output columns
- Documentation-as-code — model descriptions and column definitions live alongside SQL in version control, auto-published to Calabi Catalogue
- Environments — separate
dev,staging, andprodenvironments with schema isolation - Incremental models — efficiently process only new or changed records, not full table scans
Project structure
calabi-transform/
├── models/
│ ├── silver/ # Cleaned, conformed models
│ │ ├── commerce_clean/
│ │ │ ├── customers.sql
│ │ │ └── customers.yml
│ │ └── finance_clean/
│ └── gold/ # Business aggregates and marts
│ ├── commerce_mart/
│ │ ├── fact_orders.sql
│ │ └── dim_customer.sql
│ └── finance_mart/
├── tests/ # Custom SQL assertion tests
├── macros/ # Reusable SQL macros
└── dbt_project.yml
Go to Calabi Transform documentation →
Data Orchestration — Calabi Pipelines
Calabi Pipelines is the workflow orchestration engine. It schedules, executes, monitors, and alerts on all data pipeline DAGs — from ingestion triggers to transformation runs to ML model training jobs.
Key capabilities
- Python-based DAG authoring — write pipelines as code with full Python expressiveness
- Dependency management — define explicit task dependencies; tasks execute in the correct order with automatic retry on failure
- Rich scheduling — cron expressions, data-aware scheduling (trigger when upstream dataset updates), or manual triggers
- Pre-built operators — native operators for Calabi Connect syncs, Calabi Transform runs, S3 operations, Redshift queries, and AWS Glue
- Monitoring & alerting — real-time DAG run status, task duration metrics, SLA miss detection, and email/Slack alerting on failure
- Role-based DAG access — control which teams can view, trigger, or edit specific DAGs
Example DAG structure
# calabi-pipelines/dags/medallion_commerce_daily.py
from calabi.operators import CalabiConnectOperator, CalabiTransformOperator
with DAG("medallion_commerce_daily", schedule="0 6 * * *", catchup=False):
ingest = CalabiConnectOperator(
task_id="sync_all_commerce_sources",
connection_ids=["salesforce_prod", "stripe_prod", "postgres_orders"],
)
transform_silver = CalabiTransformOperator(
task_id="run_silver_models",
select="tag:silver,tag:commerce",
test=True, # Block Gold if Silver tests fail
)
transform_gold = CalabiTransformOperator(
task_id="run_gold_models",
select="tag:gold,tag:commerce",
test=True,
)
ingest >> transform_silver >> transform_gold
Go to Calabi Pipelines documentation →
Data Quality
Data quality in Calabi operates at two levels: inline tests embedded in Calabi Transform models (run at pipeline time), and continuous quality monitoring via profiling and anomaly detection on live tables.
Inline testing (pipeline-time)
Every Calabi Transform model can declare tests in its .yml configuration. Tests run automatically after each transformation job. Failures block downstream models from running and surface alerts in Calabi Pipelines.
Built-in test types:
| Test | What it checks |
|---|---|
not_null | Column contains no null values |
unique | Column values are distinct |
accepted_values | Column only contains values from a defined list |
relationships | Foreign key exists in the referenced table |
| Custom SQL | Any business logic expressible as a SQL WHERE clause |
Continuous quality monitoring (Calabi Catalogue)
Beyond pipeline-time tests, Calabi Catalogue runs scheduled table profiling jobs that compute:
- Row counts and trends over time
- Null rate per column
- Cardinality and value distributions
- Min / max / mean / percentile statistics
- Schema change detection (new columns, dropped columns, type changes)
Anomaly detection compares each profiling run against historical baselines and raises alerts when metrics deviate beyond configured thresholds — for example, if daily row count drops by more than 20% compared to the trailing 7-day average.
Quality scores
Calabi Catalogue assigns a data quality score to each table, combining:
- Test pass rate (inline and scheduled)
- Profiling freshness
- Documentation completeness
- Ownership assignment
Scores are visible in search results, data product pages, and the CalabiIQ quality dashboard.
Data Governance — Calabi Catalogue
Calabi Catalogue is the unified governance surface for all data assets in the platform. It is automatically populated by metadata from Calabi Connect, Calabi Transform, Calabi Pipelines, and CalabiIQ — no manual registration required.
Key capabilities
Discovery
- Full-text search across all tables, columns, dashboards, pipelines, and ML models
- Filter by domain, owner, tag, classification, quality score, or data tier
- Recently viewed and most popular assets surfaced on the home page
Lineage
- Automatic column-level lineage from Calabi Transform runs
- End-to-end lineage from source system → Bronze → Silver → Gold → Dashboard
- Impact analysis: "If I change this column, what downstream assets break?"
Business Glossary
- Canonical definitions for business terms (e.g., "Monthly Active User", "Net Revenue")
- Terms linked to the specific columns and tables that implement them
- Approved by Data Stewards; versioned and auditable
Data Classification
- Tag columns with sensitivity labels:
PII,Confidential,Internal,Public - Classification tags drive access policies and masking rules automatically
- Auto-classification suggestions using pattern matching on column names and sample values
Data Products
- Curate collections of related tables, dashboards, and metrics into a Data Product
- Assign an owner team and SLA
- Enable self-service access requests, approved by the owning Data Steward
Go to Calabi Catalogue documentation →
Putting It All Together
A complete, production-grade data pipeline in Calabi looks like this:
- Calabi Connect syncs raw data from Salesforce, Stripe, and Postgres into
raw_*schemas (Bronze) - Calabi Pipelines triggers the Calabi Transform silver run; Silver models clean, deduplicate, and conform the data into
*_cleanschemas - Silver tests run — if any fail, the Gold run is blocked and an alert fires
- Gold models build
*_martschemas with business facts, dimensions, and aggregates - Calabi Catalogue automatically indexes the new Gold tables, traces lineage back to Bronze, and updates quality scores
- CalabiIQ dashboards connected to
*_martschemas refresh with the latest data - Calabi AI Agent can answer natural language questions using the trusted Gold layer
- The owning Data Steward receives a quality score summary and approves the Data Product for self-service access
What's Next
- Calabi Connect — Configure your first data source connection
- Calabi Transform — Author your first Silver and Gold models
- Calabi Pipelines — Schedule your end-to-end medallion pipeline
- Calabi Catalogue — Explore, govern, and document your data assets
- Medallion Architecture — Best practices for Bronze, Silver, and Gold layer design