Medallion Lakehouse Architecture
The medallion architecture is the recommended pattern for structuring data in Calabi. It organises your lakehouse into three progressively refined layers — Bronze, Silver, and Gold — each serving a distinct purpose and audience. This approach makes data pipelines easier to debug, improves data quality incrementally, and provides a clear contract between producer and consumer teams.
What is Medallion Architecture?
| Layer | Also called | What it contains | Primary audience |
|---|---|---|---|
| Bronze | Raw / Landing | Exact copies of source data — no transformations | Data Engineers |
| Silver | Cleaned / Conformed | Deduplicated, typed, standardised, joined | Data Engineers, Analysts |
| Gold | Business / Mart | Aggregated, business-ready metrics and dimensions | Analysts, Executives, AI |
The key principle is immutability at each layer boundary: Bronze is never modified after ingestion; Silver is rebuilt from Bronze; Gold is rebuilt from Silver. This means any layer can be fully reconstructed from upstream without data loss, making incident recovery straightforward and auditable.
How Calabi Implements Medallion Architecture
Bronze Layer — Raw Ingestion
Calabi Connect is responsible for landing raw data into the Bronze layer. Connectors pull from source systems (databases, SaaS APIs, files, event streams) and write records directly to S3 in their original structure, or into a raw_* schema in your warehouse.
Characteristics of Bronze data:
- Schema-on-write — data is stored exactly as received from the source
- Append-only — new syncs append records; history is never overwritten
- Watermarked — each record includes a
_calabi_extracted_atcolumn set at ingestion time - Unmodified — nulls, duplicates, and type inconsistencies are preserved intentionally
Storage location convention:
s3://{tenant}-datalake/bronze/{source_name}/{table_name}/year=YYYY/month=MM/day=DD/
Or within your warehouse:
-- Schema per source system
schema: raw_{source_name} -- e.g., raw_salesforce, raw_stripe, raw_postgres
tables: {source_table_name} -- e.g., accounts, orders, events
Silver Layer — Cleaned and Conformed
Calabi Transform (SQL-based transformation) reads from Bronze and produces the Silver layer. Silver models are where data quality guarantees begin.
Silver models perform:
- Type casting — parse source strings into proper timestamps, decimals, and booleans
- Deduplication — apply row-number or watermark logic to keep the latest record per primary key
- Null handling — apply default values or flag records with quality classification columns
- Standardisation — normalise phone numbers, country codes, currency codes, and date formats to canonical forms
- Entity resolution — join related Bronze tables into conformed, denormalised entities
Storage location convention:
s3://{tenant}-datalake/silver/{domain}/{entity}/
Or within your warehouse:
schema: {domain}_clean -- e.g., commerce_clean, finance_clean, product_clean
tables: {entity_name} -- e.g., orders, customers, products, sessions
Every Silver model must carry data quality tests:
# calabi-transform/models/silver/commerce_clean/customers.yml
models:
- name: customers
description: Deduplicated, type-cast customer records from all ingestion sources.
columns:
- name: customer_id
description: Surrogate key — unique per customer across all source systems
tests:
- not_null
- unique
- name: email
tests:
- not_null
- name: created_at
description: Parsed timestamp from source; UTC normalised
tests:
- not_null
Gold Layer — Business-Ready Aggregates
Calabi Transform produces Gold layer models (data marts) from Silver. These are the tables that analysts, dashboards, and AI agents consume directly.
Gold models include:
- Fact tables — transactional data at the grain of business events (orders, payments, sessions)
- Dimension tables — slowly-changing descriptive attributes (dim_customer, dim_product, dim_date)
- Aggregate tables — pre-computed metrics for BI performance (daily_revenue, monthly_active_users)
- Wide / one-big-table models — fully denormalised tables optimised for direct BI consumption
Storage location convention:
schema: {domain}_mart -- e.g., commerce_mart, finance_mart, product_mart
tables: fact_{name} -- e.g., fact_orders, fact_sessions
dim_{name} -- e.g., dim_customer, dim_product, dim_date
agg_{name} -- e.g., agg_daily_revenue, agg_mau
After Gold tables are built:
- Calabi Catalogue automatically indexes them and surfaces owners, tags, and data product definitions
- CalabiIQ connects to Gold schemas for all dashboard and SQL Lab queries
- Calabi AI Agent uses Gold tables as its primary semantic query layer for natural language questions
Full Medallion Flow Diagram
Orchestrating the Medallion Pipeline
Calabi Pipelines schedules and monitors the end-to-end medallion flow. A well-structured pipeline DAG enforces layer ordering and prevents bad data from propagating downstream:
medallion_commerce_daily
├── ingest_sources ← Calabi Connect sync tasks
│ ├── sync_salesforce_accounts
│ ├── sync_stripe_charges
│ └── sync_postgres_orders
│
├── transform_silver ← Calabi Transform — Silver models
│ ├── dbt_run --select tag:silver
│ └── dbt_test --select tag:silver ← Gate: fails here? Gold does not run.
│
└── transform_gold ← Calabi Transform — Gold models
├── dbt_run --select tag:gold
└── dbt_test --select tag:gold
Each stage is gated by the tests of the previous stage. If Silver tests fail, Gold models do not run — protecting all downstream dashboards and AI queries from corrupted data.
Best Practices
Naming Conventions
| Layer | Schema pattern | Table pattern | Concrete example |
|---|---|---|---|
| Bronze | raw_{source} | {source_table} | raw_salesforce.accounts |
| Silver | {domain}_clean | {entity} | commerce_clean.customers |
| Gold — Facts | {domain}_mart | fact_{name} | commerce_mart.fact_orders |
| Gold — Dims | {domain}_mart | dim_{name} | commerce_mart.dim_customer |
| Gold — Aggs | {domain}_mart | agg_{name} | commerce_mart.agg_daily_revenue |
Schema Separation
Never mix layers in the same schema. Bronze, Silver, and Gold must live in separate schemas (or S3 prefixes). This gives you:
- Access control — grant analysts read access to
*_martschemas only - Cost attribution — warehouse scan costs traceable per layer
- Lineage clarity — Calabi Catalogue can clearly trace cross-schema lineage
- Blast radius containment — a bad Silver run cannot corrupt Gold if schemas are separate
Testing at Each Layer
| Layer | Minimum required tests |
|---|---|
| Bronze | Row count > 0 after each sync; _calabi_extracted_at is not null |
| Silver | Primary key uniqueness; not-null on key business columns; referential integrity checks |
| Gold | Metric consistency (Gold revenue sum = Silver order sum); row count stability vs. prior run |
Incremental vs. Full Refresh
| Layer | Recommended strategy | Rationale |
|---|---|---|
| Bronze | Always append (or CDC) | Preserves full history; never truncate raw data |
| Silver | Incremental on _calabi_extracted_at | Efficient; full refresh weekly or on schema change |
| Gold — Aggs | Full refresh | Simple, predictable; small table sizes |
| Gold — Facts | Incremental | Large tables; append new event partitions |
Lineage in Calabi Catalogue
Calabi Catalogue automatically traces lineage across all three layers. Every Calabi Transform run pushes lineage metadata so you can see exactly which Bronze tables feed a Gold mart — and which CalabiIQ dashboards or AI sessions depend on it.
Navigate to Calabi Catalogue → Data Assets → [any Gold table] → Lineage to see the full upstream and downstream graph, including column-level lineage where supported.
What's Next
- Data Management Overview — Full lifecycle: Connect, Transform, Pipelines, Quality, Catalogue
- Governance Overview — Apply owners, tags, and access policies to your medallion assets
- Platform Architecture — Understand the infrastructure that runs the medallion pipeline