Medallion Lakehouse Architecture

Starter+Professional+Enterprise

The medallion architecture is the recommended pattern for structuring data in Calabi. It organises your lakehouse into three progressively refined layers — Bronze, Silver, and Gold — each serving a distinct purpose and audience. This approach makes data pipelines easier to debug, improves data quality incrementally, and provides a clear contract between producer and consumer teams.

What is Medallion Architecture?

Layer	Also called	What it contains	Primary audience
Bronze	Raw / Landing	Exact copies of source data — no transformations	Data Engineers
Silver	Cleaned / Conformed	Deduplicated, typed, standardised, joined	Data Engineers, Analysts
Gold	Business / Mart	Aggregated, business-ready metrics and dimensions	Analysts, Executives, AI

The key principle is immutability at each layer boundary: Bronze is never modified after ingestion; Silver is rebuilt from Bronze; Gold is rebuilt from Silver. This means any layer can be fully reconstructed from upstream without data loss, making incident recovery straightforward and auditable.

How Calabi Implements Medallion Architecture

Bronze Layer — Raw Ingestion

Calabi Connect is responsible for landing raw data into the Bronze layer. Connectors pull from source systems (databases, SaaS APIs, files, event streams) and write records directly to S3 in their original structure, or into a raw_* schema in your warehouse.

Characteristics of Bronze data:

Schema-on-write — data is stored exactly as received from the source
Append-only — new syncs append records; history is never overwritten
Watermarked — each record includes a _calabi_extracted_at column set at ingestion time
Unmodified — nulls, duplicates, and type inconsistencies are preserved intentionally

Storage location convention:

s3://{tenant}-datalake/bronze/{source_name}/{table_name}/year=YYYY/month=MM/day=DD/

Or within your warehouse:

-- Schema per source system
schema:  raw_{source_name}    -- e.g., raw_salesforce, raw_stripe, raw_postgres
tables:  {source_table_name}  -- e.g., accounts, orders, events

Silver Layer — Cleaned and Conformed

Calabi Transform (SQL-based transformation) reads from Bronze and produces the Silver layer. Silver models are where data quality guarantees begin.

Silver models perform:

Type casting — parse source strings into proper timestamps, decimals, and booleans
Deduplication — apply row-number or watermark logic to keep the latest record per primary key
Null handling — apply default values or flag records with quality classification columns
Standardisation — normalise phone numbers, country codes, currency codes, and date formats to canonical forms
Entity resolution — join related Bronze tables into conformed, denormalised entities

Storage location convention:

s3://{tenant}-datalake/silver/{domain}/{entity}/

Or within your warehouse:

schema:  {domain}_clean       -- e.g., commerce_clean, finance_clean, product_clean
tables:  {entity_name}        -- e.g., orders, customers, products, sessions

Every Silver model must carry data quality tests:

# calabi-transform/models/silver/commerce_clean/customers.yml
models:
  - name: customers
    description: Deduplicated, type-cast customer records from all ingestion sources.
    columns:
      - name: customer_id
        description: Surrogate key — unique per customer across all source systems
        tests:
          - not_null
          - unique
      - name: email
        tests:
          - not_null
      - name: created_at
        description: Parsed timestamp from source; UTC normalised
        tests:
          - not_null

Gold Layer — Business-Ready Aggregates

Calabi Transform produces Gold layer models (data marts) from Silver. These are the tables that analysts, dashboards, and AI agents consume directly.

Gold models include:

Fact tables — transactional data at the grain of business events (orders, payments, sessions)
Dimension tables — slowly-changing descriptive attributes (dim_customer, dim_product, dim_date)
Aggregate tables — pre-computed metrics for BI performance (daily_revenue, monthly_active_users)
Wide / one-big-table models — fully denormalised tables optimised for direct BI consumption

Storage location convention:

schema:  {domain}_mart         -- e.g., commerce_mart, finance_mart, product_mart
tables:  fact_{name}           -- e.g., fact_orders, fact_sessions
         dim_{name}            -- e.g., dim_customer, dim_product, dim_date
         agg_{name}            -- e.g., agg_daily_revenue, agg_mau

After Gold tables are built:

Calabi Catalogue automatically indexes them and surfaces owners, tags, and data product definitions
CalabiIQ connects to Gold schemas for all dashboard and SQL Lab queries
Calabi AI Agent uses Gold tables as its primary semantic query layer for natural language questions

Full Medallion Flow Diagram

External Sources

DatabasesPostgres · MySQL · Oracle · DynamoDB

SaaS APIsSalesforce · HubSpot · Stripe · Zendesk

FilesCSV · Parquet · JSON · Google Sheets

Event StreamsKinesis · Kafka · SNS · EventBridge

Calabi Connect — ELT Ingestion

Bronze LayerRaw — as-is from source

S3 Data Lakebronze/ prefix · Parquet · partitioned by date

Warehouse raw_* schemasImmutable copy of source data

Full HistoryNo transformations — always reprocessable

Calabi Transform — Clean · Deduplicate · Cast

Silver LayerCleaned & conformed

S3 silver/ prefixValidated · typed · deduplicated

Warehouse *_clean schemasStaging models · conformed dimensions

Quality Testsnot_null · unique · referential integrity

Lineage TrackedColumn-level lineage from source → silver

Calabi Transform — Aggregate · Model · Test

Gold LayerBusiness-ready — trusted & documented

Warehouse *_mart schemasFact & dimension tables · aggregates

Certified AssetsOwner assigned · description · quality score

Semantic MetricsRevenue · MAU · Churn — canonical definitions

Serve to downstream consumers

Consumption LayerAnalytics · AI · Governance

CalabiIQDashboards · SQL Lab · Scheduled Reports

Calabi AI AgentNatural language queries over Gold tables

Calabi CatalogueDiscovery · lineage · glossary · data products

Calabi MLTraining datasets from Gold · feature stores

Data flows from raw sources through progressively cleaner layers to trusted, business-ready assets

Orchestrating the Medallion Pipeline

Calabi Pipelines schedules and monitors the end-to-end medallion flow. A well-structured pipeline DAG enforces layer ordering and prevents bad data from propagating downstream:

medallion_commerce_daily
├── ingest_sources                   ← Calabi Connect sync tasks
│   ├── sync_salesforce_accounts
│   ├── sync_stripe_charges
│   └── sync_postgres_orders
│
├── transform_silver                 ← Calabi Transform — Silver models
│   ├── dbt_run  --select tag:silver
│   └── dbt_test --select tag:silver   ← Gate: fails here? Gold does not run.
│
└── transform_gold                   ← Calabi Transform — Gold models
    ├── dbt_run  --select tag:gold
    └── dbt_test --select tag:gold

Each stage is gated by the tests of the previous stage. If Silver tests fail, Gold models do not run — protecting all downstream dashboards and AI queries from corrupted data.

Best Practices

Naming Conventions

Layer	Schema pattern	Table pattern	Concrete example
Bronze	`raw_{source}`	`{source_table}`	`raw_salesforce.accounts`
Silver	`{domain}_clean`	`{entity}`	`commerce_clean.customers`
Gold — Facts	`{domain}_mart`	`fact_{name}`	`commerce_mart.fact_orders`
Gold — Dims	`{domain}_mart`	`dim_{name}`	`commerce_mart.dim_customer`
Gold — Aggs	`{domain}_mart`	`agg_{name}`	`commerce_mart.agg_daily_revenue`

Schema Separation

Never mix layers in the same schema. Bronze, Silver, and Gold must live in separate schemas (or S3 prefixes). This gives you:

Access control — grant analysts read access to *_mart schemas only
Cost attribution — warehouse scan costs traceable per layer
Lineage clarity — Calabi Catalogue can clearly trace cross-schema lineage
Blast radius containment — a bad Silver run cannot corrupt Gold if schemas are separate

Testing at Each Layer

Layer	Minimum required tests
Bronze	Row count > 0 after each sync; `_calabi_extracted_at` is not null
Silver	Primary key uniqueness; not-null on key business columns; referential integrity checks
Gold	Metric consistency (Gold revenue sum = Silver order sum); row count stability vs. prior run

Incremental vs. Full Refresh

Layer	Recommended strategy	Rationale
Bronze	Always append (or CDC)	Preserves full history; never truncate raw data
Silver	Incremental on `_calabi_extracted_at`	Efficient; full refresh weekly or on schema change
Gold — Aggs	Full refresh	Simple, predictable; small table sizes
Gold — Facts	Incremental	Large tables; append new event partitions

Lineage in Calabi Catalogue

Calabi Catalogue automatically traces lineage across all three layers. Every Calabi Transform run pushes lineage metadata so you can see exactly which Bronze tables feed a Gold mart — and which CalabiIQ dashboards or AI sessions depend on it.

Navigate to Calabi Catalogue → Data Assets → [any Gold table] → Lineage to see the full upstream and downstream graph, including column-level lineage where supported.

What's Next

Data Management Overview — Full lifecycle: Connect, Transform, Pipelines, Quality, Catalogue
Governance Overview — Apply owners, tags, and access policies to your medallion assets
Platform Architecture — Understand the infrastructure that runs the medallion pipeline

What is Medallion Architecture?​

How Calabi Implements Medallion Architecture​

Bronze Layer — Raw Ingestion​

Silver Layer — Cleaned and Conformed​

Gold Layer — Business-Ready Aggregates​

Full Medallion Flow Diagram​

Orchestrating the Medallion Pipeline​

Best Practices​

Naming Conventions​

Schema Separation​

Testing at Each Layer​

Incremental vs. Full Refresh​

Lineage in Calabi Catalogue​

What's Next​

What is Medallion Architecture?

How Calabi Implements Medallion Architecture

Bronze Layer — Raw Ingestion

Silver Layer — Cleaned and Conformed

Gold Layer — Business-Ready Aggregates

Full Medallion Flow Diagram

Orchestrating the Medallion Pipeline

Best Practices

Naming Conventions

Schema Separation

Testing at Each Layer

Incremental vs. Full Refresh

Lineage in Calabi Catalogue

What's Next