Data Profiling

Professional+

Data Profiling automatically scans your tables and computes per-column statistics — null counts, distinct values, min/max ranges, value distributions, and more. Profiling results are surfaced on every asset page, empowering analysts to assess data shape and quality before writing a single query.

What Profiling Computes

Table-level statistics

Statistic	Description
Row count	Total rows in the table at the time of profiling
Column count	Number of columns
Timestamp	When the profile was computed
Size	Approximate table size on disk (where available)

Column-level statistics

Statistic	Applies to	Description
Null count	All types	Number of rows where the column is NULL
Null %	All types	Null count / total rows × 100
Distinct count	All types	Approximate count of unique values
Distinct %	All types	Distinct count / total rows × 100
Min	Numeric, Date	Smallest value
Max	Numeric, Date	Largest value
Mean	Numeric	Arithmetic average
Median	Numeric	50th percentile value
Std deviation	Numeric	Standard deviation
Sum	Numeric	Column total
Histogram	Numeric	Value distribution across N buckets
Value frequency	String, Low-cardinality	Top N most common values with counts
Min length	String	Shortest string length (characters)
Max length	String	Longest string length (characters)
Mean length	String	Average string length

Profiling in the Calabi Catalogue

Profiling results appear in two places on every table's asset page:

Schema tab — per-column statistics are shown inline next to each column name.
Profiling tab — a full-page view of all column statistics with histograms and distribution charts.

Reading the Schema tab profiling inline view

Column: score                          INTEGER
  Null %: 0.0%    Distinct: 94    Min: 23    Max: 100    Mean: 72.4
  [=======] Distribution ▶ click to expand histogram

Reading the Profiling tab

The Profiling tab shows a card per column. Numeric columns show a histogram. String columns show a value frequency bar chart.

Setting Up Profiling

Step 1 — Create a Profiler pipeline

Navigate to the service that contains your database (e.g., Services → Database Services → Redshift Production).
Click Ingestion tab → + Add ingestion.
Select Profiler as the pipeline type.
Configure which schemas and tables to profile:

# Profiler scope configuration
schemaFilterPattern:
  includes:
    - psychometric_reporting
    - dim
tableFilterPattern:
  includes:
    - scores
    - candidates
    - exam_metadata
  excludes:
    - _tmp_*
    - stg_*

Configure sampling (see below).
Set the schedule.
Click Deploy.

Step 2 — Configure sampling

Profiling on large tables can be slow and expensive. Use sampling to profile a representative subset:

Sampling method	When to use
Percentage sample	Profile a random N% of rows
Row limit	Profile only the first N rows
Full table	Profile all rows

# Example: 10% sample
profilerConfiguration:
  profileSample: 10         # percentage
  profileSampleType: PERCENTAGE

# Example: fixed 500,000 row limit
profilerConfiguration:
  profileSample: 500000
  profileSampleType: ROWS

Full-table profiling on large tables

Profiling a 500M-row table without sampling can take hours and consume significant compute. Always start with a 5–10% sample and validate results before enabling full-table profiling.

Scheduling Profiling

Data change frequency	Recommended profiling schedule
Hourly / real-time data	Daily (profiling lag is acceptable)
Daily ETL	After the ETL window completes
Weekly batch	Weekly
Reference / lookup tables	Monthly or on schema change

Configure the schedule in the Profiler pipeline:

Cron: 0 7 * * *    # Daily at 7 AM UTC, after overnight ETL

Excluding Columns from Profiling

Some columns should not be profiled — for example, PII columns (to avoid logging sensitive value samples), or columns with very high cardinality (free-text, UUIDs) that produce expensive distinct-count computations.

# Exclude specific columns from profiling
columnFilterPattern:
  excludes:
    - email_address
    - full_name
    - national_id
    - raw_response_json

Or from the UI:

Open the table's asset page → Schema tab.
Click a column.
Toggle Exclude from profiling → Save.

Column Value Distribution (Histogram)

For numeric columns, profiling computes a histogram with configurable bucket counts:

score distribution (100 rows sample):
  0–20   | ▌          2.1%
  21–40  | ██▌        8.4%
  41–60  | █████      16.9%
  61–80  | ████████   28.3%
  81–100 | ████████████ 44.3%

This immediately reveals:

Whether the data is normally distributed, skewed, or bimodal.
Whether the full expected range is represented.
Whether outlier values exist at the extremes.

Value Frequency (String Columns)

For string and low-cardinality columns, profiling shows the top N most common values:

region value frequency:
  EMEA    | ██████████████  42.1%
  AMER    | ████████████    38.7%
  APAC    | ████            15.3%
  OTHER   | █               3.9%

This instantly reveals:

Expected vs unexpected values.
Data quality issues (e.g., Emea and EMEA as separate values indicating inconsistent casing).
Distribution bias.

Profiling Results in Asset Pages

Finding profiling stats on the Schema tab

Open any catalogued table.
Click the Schema tab.
Each column shows:
- Null % — displayed as a small bar filling based on the null percentage.
- Distinct % — distinct value ratio.
- Quick min/max for numerics, or character length stats for strings.
Click a column to expand the full profiling card with histogram.

Profiling tab

For a full-page view of all column profiles:

Open the table.
Click the Profiling tab (if not visible, it may be under ··· → Profiling).

The Profiling tab also shows the profile history — how statistics have changed across profile runs — which helps detect data drift.

Data Drift Detection

When profiling runs on a schedule, you can track how column statistics change over time.

Metric to watch	Drift signal
Null %	Sudden spike in nulls → upstream pipeline failure
Distinct count	Unexplained growth → possible data duplication
Min / Max	New extreme value → out-of-range data from upstream
Mean	Shift in average → segment mix change or data error
Row count	Unexpected drop → data loss; unexpected spike → double-loading

The Profiling tab shows trend lines for each statistic across historical profile runs. Combined with Data Quality alerts, this gives you early warning of pipeline issues before they affect downstream reporting.

Profiling and PII

When PII-tagged columns are included in profiling scope, the Catalogue applies protective measures:

Value frequency samples for PII columns are suppressed in the UI (shown as ***).
Profiling statistics (null %, distinct count, min/max) are still computed and shown for schema understanding.
Actual row samples from profiling are never stored.

To completely exclude a PII column from all profiling:

columnFilterPattern:
  excludes:
    - email_address
    - full_name

Profiling Metrics Reference

Metric	Numeric	String	Date	Boolean
Null count	Yes	Yes	Yes	Yes
Null %	Yes	Yes	Yes	Yes
Distinct count	Yes	Yes	Yes	Yes
Min	Yes	—	Yes	—
Max	Yes	—	Yes	—
Mean	Yes	—	—	—
Median	Yes	—	—	—
Std deviation	Yes	—	—	—
Sum	Yes	—	—	—
Histogram	Yes	—	Yes	—
Value frequency	—	Yes	—	Yes
Min length	—	Yes	—	—
Max length	—	Yes	—	—

Troubleshooting Profiling

Symptom	Likely cause	Resolution
Profile job times out	Table too large, no sampling configured	Add `profileSample: 10` (10%)
Null % always shows 0 even for known-null columns	Column excluded from profiling filter	Check `columnFilterPattern` config
Histogram shows only one bucket	All values are the same	Expected — data has no variance
Distinct count very high for string column	Free-text or UUID column	Exclude from profiling or accept result
Profile not showing on asset page	Pipeline has not run successfully	Check ingestion pipeline logs
Old profile data showing	Schedule not configured or pipeline not running	Verify cron schedule in pipeline config

What Profiling Computes​

Table-level statistics​

Column-level statistics​

Profiling in the Calabi Catalogue​

Reading the Schema tab profiling inline view​

Reading the Profiling tab​

Setting Up Profiling​

Step 1 — Create a Profiler pipeline​

Step 2 — Configure sampling​

Scheduling Profiling​

Excluding Columns from Profiling​

Column Value Distribution (Histogram)​

Value Frequency (String Columns)​

Profiling Results in Asset Pages​

Finding profiling stats on the Schema tab​

Profiling tab​

Data Drift Detection​

Profiling and PII​

Profiling Metrics Reference​

Troubleshooting Profiling​