Skip to main content

Medallion Lakehouse Architecture

Starter+Professional+Enterprise

The medallion architecture is the recommended pattern for structuring data in Calabi. It organises your lakehouse into three progressively refined layers — Bronze, Silver, and Gold — each serving a distinct purpose and audience. This approach makes data pipelines easier to debug, improves data quality incrementally, and provides a clear contract between producer and consumer teams.


What is Medallion Architecture?

LayerAlso calledWhat it containsPrimary audience
BronzeRaw / LandingExact copies of source data — no transformationsData Engineers
SilverCleaned / ConformedDeduplicated, typed, standardised, joinedData Engineers, Analysts
GoldBusiness / MartAggregated, business-ready metrics and dimensionsAnalysts, Executives, AI

The key principle is immutability at each layer boundary: Bronze is never modified after ingestion; Silver is rebuilt from Bronze; Gold is rebuilt from Silver. This means any layer can be fully reconstructed from upstream without data loss, making incident recovery straightforward and auditable.


How Calabi Implements Medallion Architecture

Bronze Layer — Raw Ingestion

Calabi Connect is responsible for landing raw data into the Bronze layer. Connectors pull from source systems (databases, SaaS APIs, files, event streams) and write records directly to S3 in their original structure, or into a raw_* schema in your warehouse.

Characteristics of Bronze data:

  • Schema-on-write — data is stored exactly as received from the source
  • Append-only — new syncs append records; history is never overwritten
  • Watermarked — each record includes a _calabi_extracted_at column set at ingestion time
  • Unmodified — nulls, duplicates, and type inconsistencies are preserved intentionally

Storage location convention:

s3://{tenant}-datalake/bronze/{source_name}/{table_name}/year=YYYY/month=MM/day=DD/

Or within your warehouse:

-- Schema per source system
schema: raw_{source_name} -- e.g., raw_salesforce, raw_stripe, raw_postgres
tables: {source_table_name} -- e.g., accounts, orders, events

Silver Layer — Cleaned and Conformed

Calabi Transform (SQL-based transformation) reads from Bronze and produces the Silver layer. Silver models are where data quality guarantees begin.

Silver models perform:

  • Type casting — parse source strings into proper timestamps, decimals, and booleans
  • Deduplication — apply row-number or watermark logic to keep the latest record per primary key
  • Null handling — apply default values or flag records with quality classification columns
  • Standardisation — normalise phone numbers, country codes, currency codes, and date formats to canonical forms
  • Entity resolution — join related Bronze tables into conformed, denormalised entities

Storage location convention:

s3://{tenant}-datalake/silver/{domain}/{entity}/

Or within your warehouse:

schema:  {domain}_clean       -- e.g., commerce_clean, finance_clean, product_clean
tables: {entity_name} -- e.g., orders, customers, products, sessions

Every Silver model must carry data quality tests:

# calabi-transform/models/silver/commerce_clean/customers.yml
models:
- name: customers
description: Deduplicated, type-cast customer records from all ingestion sources.
columns:
- name: customer_id
description: Surrogate key — unique per customer across all source systems
tests:
- not_null
- unique
- name: email
tests:
- not_null
- name: created_at
description: Parsed timestamp from source; UTC normalised
tests:
- not_null

Gold Layer — Business-Ready Aggregates

Calabi Transform produces Gold layer models (data marts) from Silver. These are the tables that analysts, dashboards, and AI agents consume directly.

Gold models include:

  • Fact tables — transactional data at the grain of business events (orders, payments, sessions)
  • Dimension tables — slowly-changing descriptive attributes (dim_customer, dim_product, dim_date)
  • Aggregate tables — pre-computed metrics for BI performance (daily_revenue, monthly_active_users)
  • Wide / one-big-table models — fully denormalised tables optimised for direct BI consumption

Storage location convention:

schema:  {domain}_mart         -- e.g., commerce_mart, finance_mart, product_mart
tables: fact_{name} -- e.g., fact_orders, fact_sessions
dim_{name} -- e.g., dim_customer, dim_product, dim_date
agg_{name} -- e.g., agg_daily_revenue, agg_mau

After Gold tables are built:

  • Calabi Catalogue automatically indexes them and surfaces owners, tags, and data product definitions
  • CalabiIQ connects to Gold schemas for all dashboard and SQL Lab queries
  • Calabi AI Agent uses Gold tables as its primary semantic query layer for natural language questions

Full Medallion Flow Diagram

External Sources
DatabasesPostgres · MySQL · Oracle · DynamoDB
SaaS APIsSalesforce · HubSpot · Stripe · Zendesk
FilesCSV · Parquet · JSON · Google Sheets
Event StreamsKinesis · Kafka · SNS · EventBridge
Calabi Connect — ELT Ingestion
Bronze LayerRaw — as-is from source
S3 Data Lakebronze/ prefix · Parquet · partitioned by date
Warehouse raw_* schemasImmutable copy of source data
Full HistoryNo transformations — always reprocessable
Calabi Transform — Clean · Deduplicate · Cast
Silver LayerCleaned & conformed
S3 silver/ prefixValidated · typed · deduplicated
Warehouse *_clean schemasStaging models · conformed dimensions
Quality Testsnot_null · unique · referential integrity
Lineage TrackedColumn-level lineage from source → silver
Calabi Transform — Aggregate · Model · Test
Gold LayerBusiness-ready — trusted & documented
Warehouse *_mart schemasFact & dimension tables · aggregates
Certified AssetsOwner assigned · description · quality score
Semantic MetricsRevenue · MAU · Churn — canonical definitions
Serve to downstream consumers
Consumption LayerAnalytics · AI · Governance
CalabiIQDashboards · SQL Lab · Scheduled Reports
Calabi AI AgentNatural language queries over Gold tables
Calabi CatalogueDiscovery · lineage · glossary · data products
Calabi MLTraining datasets from Gold · feature stores
Data flows from raw sources through progressively cleaner layers to trusted, business-ready assets

Orchestrating the Medallion Pipeline

Calabi Pipelines schedules and monitors the end-to-end medallion flow. A well-structured pipeline DAG enforces layer ordering and prevents bad data from propagating downstream:

medallion_commerce_daily
├── ingest_sources ← Calabi Connect sync tasks
│ ├── sync_salesforce_accounts
│ ├── sync_stripe_charges
│ └── sync_postgres_orders

├── transform_silver ← Calabi Transform — Silver models
│ ├── dbt_run --select tag:silver
│ └── dbt_test --select tag:silver ← Gate: fails here? Gold does not run.

└── transform_gold ← Calabi Transform — Gold models
├── dbt_run --select tag:gold
└── dbt_test --select tag:gold

Each stage is gated by the tests of the previous stage. If Silver tests fail, Gold models do not run — protecting all downstream dashboards and AI queries from corrupted data.


Best Practices

Naming Conventions

LayerSchema patternTable patternConcrete example
Bronzeraw_{source}{source_table}raw_salesforce.accounts
Silver{domain}_clean{entity}commerce_clean.customers
Gold — Facts{domain}_martfact_{name}commerce_mart.fact_orders
Gold — Dims{domain}_martdim_{name}commerce_mart.dim_customer
Gold — Aggs{domain}_martagg_{name}commerce_mart.agg_daily_revenue

Schema Separation

Never mix layers in the same schema. Bronze, Silver, and Gold must live in separate schemas (or S3 prefixes). This gives you:

  • Access control — grant analysts read access to *_mart schemas only
  • Cost attribution — warehouse scan costs traceable per layer
  • Lineage clarity — Calabi Catalogue can clearly trace cross-schema lineage
  • Blast radius containment — a bad Silver run cannot corrupt Gold if schemas are separate

Testing at Each Layer

LayerMinimum required tests
BronzeRow count > 0 after each sync; _calabi_extracted_at is not null
SilverPrimary key uniqueness; not-null on key business columns; referential integrity checks
GoldMetric consistency (Gold revenue sum = Silver order sum); row count stability vs. prior run

Incremental vs. Full Refresh

LayerRecommended strategyRationale
BronzeAlways append (or CDC)Preserves full history; never truncate raw data
SilverIncremental on _calabi_extracted_atEfficient; full refresh weekly or on schema change
Gold — AggsFull refreshSimple, predictable; small table sizes
Gold — FactsIncrementalLarge tables; append new event partitions

Lineage in Calabi Catalogue

Calabi Catalogue automatically traces lineage across all three layers. Every Calabi Transform run pushes lineage metadata so you can see exactly which Bronze tables feed a Gold mart — and which CalabiIQ dashboards or AI sessions depend on it.

Navigate to Calabi Catalogue → Data Assets → [any Gold table] → Lineage to see the full upstream and downstream graph, including column-level lineage where supported.


What's Next