Sources & Raw Data
In Calabi Transform, a source is a declaration of a raw table that lives in your data warehouse but was not created by Calabi Transform itself. Sources are populated by Calabi Connect sync jobs, database replications, or other ingestion processes. By declaring sources explicitly in YAML, Calabi Transform can test them, check their freshness, and track their lineage all the way to your final mart models.
Why Declare Sources?
Without source declarations, you would reference raw tables with hardcoded identifiers:
-- Without sources — fragile
select * from raw.jaffle_shop.orders
With source declarations, you use the {{ source() }} function:
-- With sources — environment-aware, testable, lineage-tracked
select * from {{ source('jaffle_shop', 'orders') }}
Benefits:
- Lineage — Calabi Catalogue shows the full lineage from raw tables to marts.
- Freshness checks — Calabi Transform can alert you when a source table has not been updated recently.
- Data tests — Apply generic tests (
not_null,unique) directly to source columns. - Documentation — Add descriptions to sources and columns in one place.
Defining Sources in YAML
Sources are defined in .yml files alongside your models. By convention these are named _sources.yml or placed in a sources/ subdirectory.
Minimal Example
# models/staging/_sources.yml
version: 2
sources:
- name: jaffle_shop
schema: jaffle_shop # The schema in your warehouse where raw tables live
tables:
- name: orders
- name: customers
- name: payments
With this declaration, you can reference {{ source('jaffle_shop', 'orders') }} in any model.
Full Example with Documentation and Tests
version: 2
sources:
- name: jaffle_shop
description: >
Replica of the Jaffle Shop production database, loaded
nightly by Calabi Connect. Contains transactional order data.
database: raw # Optional: override the database
schema: jaffle_shop
loader: calabi_connect # Informational — who loaded this data
loaded_at_field: _etl_loaded_at # Column used for freshness checks
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
tables:
- name: orders
description: "One row per customer order."
columns:
- name: id
description: "Primary key."
data_tests:
- unique
- not_null
- name: status
description: "Order lifecycle status."
data_tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
- name: customer_id
description: "Foreign key to customers."
data_tests:
- not_null
- relationships:
to: source('jaffle_shop', 'customers')
field: id
- name: customers
description: "One row per registered customer."
columns:
- name: id
data_tests:
- unique
- not_null
- name: email
data_tests:
- not_null
- name: payments
description: "One row per payment attempt."
columns:
- name: id
data_tests:
- unique
- not_null
- name: payment_method
data_tests:
- accepted_values:
values: ['credit_card', 'bank_transfer', 'gift_card', 'coupon']
Referencing Sources in Models
Use the {{ source(source_name, table_name) }} Jinja function:
-- models/staging/stg_orders.sql
select
id as order_id,
customer_id,
status,
amount / 100.0 as amount_dollars,
created_at
from {{ source('jaffle_shop', 'orders') }}
Calabi Transform resolves {{ source('jaffle_shop', 'orders') }} to the fully qualified table name (e.g., raw.jaffle_shop.orders) based on your project configuration.
Multi-Database Sources
If raw data lands in a different database from your transformation output, specify database on the source:
sources:
- name: salesforce_raw
database: raw_prod # Database containing raw data
schema: salesforce
tables:
- name: account
- name: opportunity
- name: contact
Source Freshness Checks
Source freshness checks verify that raw tables are being updated as expected. This catches pipeline failures upstream of Calabi Transform — for example, if a Calabi Connect sync job stopped running.
Configuring Freshness
Freshness is configured at the source or table level using warn_after and error_after thresholds. Calabi Transform checks the maximum value of loaded_at_field and compares it to the current time.
sources:
- name: salesforce_raw
loaded_at_field: _fivetran_synced # or _etl_loaded_at, updated_at, etc.
freshness:
warn_after: {count: 6, period: hour} # Warn if data is >6h old
error_after: {count: 12, period: hour} # Error if data is >12h old
tables:
- name: account
- name: opportunity
# Override freshness for a specific table
freshness:
warn_after: {count: 24, period: hour}
error_after: {count: 48, period: hour}
Valid period values: minute, hour, day.
Running Freshness Checks
# Check all sources
dbt source freshness
# Check sources for a specific source name
dbt source freshness --select source:salesforce_raw
# Output freshness results to a file
dbt source freshness --output target/sources.json
Freshness Results
| Result | Meaning |
|---|---|
pass | Data is within warn_after threshold |
warn | Data is older than warn_after but newer than error_after |
error | Data is older than error_after threshold |
Trigger dbt source freshness as the first step in your Calabi Pipelines DAG before running models. If any source returns error, fail the pipeline task — this prevents downstream models from running against stale data.
Source Tests
All standard Calabi Transform generic tests (unique, not_null, accepted_values, relationships) can be applied directly to source tables in the YAML definition. This allows you to catch data quality issues before they propagate into your models.
Run source tests with:
# Test all sources
dbt test --select source:*
# Test a specific source
dbt test --select source:jaffle_shop
Source test failures appear in the Calabi Transform run results and can be surfaced in Calabi Catalogue.
Staging Models from Sources
The recommended pattern is to create one staging model per source table. Staging models perform light cleaning (renaming, type casting, filtering) and nothing else. All joins and business logic belong in intermediate models.
-- models/staging/stg_customers.sql
{{ config(materialized='view') }}
select
id as customer_id,
first_name,
last_name,
first_name || ' ' || last_name as customer_name,
lower(email) as email,
cast(created_at as timestamp) as created_at
from {{ source('jaffle_shop', 'customers') }}
-- models/staging/stg_payments.sql
{{ config(materialized='view') }}
select
id as payment_id,
order_id,
payment_method,
amount / 100.0 as amount_dollars,
status
from {{ source('jaffle_shop', 'payments') }}
where status != 'failed'
Identifier Override
When the physical table name in the warehouse differs from the name you want to use in your models, use the identifier property:
sources:
- name: salesforce_raw
schema: salesforce
tables:
- name: account # Name used in {{ source('salesforce_raw', 'account') }}
identifier: Account__c # Actual table name in the warehouse
This is common with Salesforce sources where Calabi Connect may load tables with auto-generated suffix names.
Quoting
If your source table names contain special characters, spaces, or are case-sensitive, use the quoting config:
sources:
- name: legacy_system
schema: legacy
quoting:
identifier: true # Wrap table names in quotes
tables:
- name: "Order Details" # Quoted in generated SQL
Next Steps
- Data Tests — Apply generic and singular tests to sources and models
- Models — Build transformation logic on top of sources
- Running Transformations — Execute and schedule your Calabi Transform project