Data Profiling
Data Profiling automatically scans your tables and computes per-column statistics — null counts, distinct values, min/max ranges, value distributions, and more. Profiling results are surfaced on every asset page, empowering analysts to assess data shape and quality before writing a single query.
What Profiling Computes
Table-level statistics
| Statistic | Description |
|---|---|
| Row count | Total rows in the table at the time of profiling |
| Column count | Number of columns |
| Timestamp | When the profile was computed |
| Size | Approximate table size on disk (where available) |
Column-level statistics
| Statistic | Applies to | Description |
|---|---|---|
| Null count | All types | Number of rows where the column is NULL |
| Null % | All types | Null count / total rows × 100 |
| Distinct count | All types | Approximate count of unique values |
| Distinct % | All types | Distinct count / total rows × 100 |
| Min | Numeric, Date | Smallest value |
| Max | Numeric, Date | Largest value |
| Mean | Numeric | Arithmetic average |
| Median | Numeric | 50th percentile value |
| Std deviation | Numeric | Standard deviation |
| Sum | Numeric | Column total |
| Histogram | Numeric | Value distribution across N buckets |
| Value frequency | String, Low-cardinality | Top N most common values with counts |
| Min length | String | Shortest string length (characters) |
| Max length | String | Longest string length (characters) |
| Mean length | String | Average string length |
Profiling in the Calabi Catalogue
Profiling results appear in two places on every table's asset page:
- Schema tab — per-column statistics are shown inline next to each column name.
- Profiling tab — a full-page view of all column statistics with histograms and distribution charts.
Reading the Schema tab profiling inline view
Column: score INTEGER
Null %: 0.0% Distinct: 94 Min: 23 Max: 100 Mean: 72.4
[=======] Distribution ▶ click to expand histogram
Reading the Profiling tab
The Profiling tab shows a card per column. Numeric columns show a histogram. String columns show a value frequency bar chart.
Setting Up Profiling
Step 1 — Create a Profiler pipeline
- Navigate to the service that contains your database (e.g., Services → Database Services →
Redshift Production). - Click Ingestion tab → + Add ingestion.
- Select Profiler as the pipeline type.
- Configure which schemas and tables to profile:
# Profiler scope configuration
schemaFilterPattern:
includes:
- psychometric_reporting
- dim
tableFilterPattern:
includes:
- scores
- candidates
- exam_metadata
excludes:
- _tmp_*
- stg_*
- Configure sampling (see below).
- Set the schedule.
- Click Deploy.
Step 2 — Configure sampling
Profiling on large tables can be slow and expensive. Use sampling to profile a representative subset:
| Sampling method | When to use |
|---|---|
| Percentage sample | Profile a random N% of rows |
| Row limit | Profile only the first N rows |
| Full table | Profile all rows |
# Example: 10% sample
profilerConfiguration:
profileSample: 10 # percentage
profileSampleType: PERCENTAGE
# Example: fixed 500,000 row limit
profilerConfiguration:
profileSample: 500000
profileSampleType: ROWS
Profiling a 500M-row table without sampling can take hours and consume significant compute. Always start with a 5–10% sample and validate results before enabling full-table profiling.
Scheduling Profiling
| Data change frequency | Recommended profiling schedule |
|---|---|
| Hourly / real-time data | Daily (profiling lag is acceptable) |
| Daily ETL | After the ETL window completes |
| Weekly batch | Weekly |
| Reference / lookup tables | Monthly or on schema change |
Configure the schedule in the Profiler pipeline:
Cron: 0 7 * * * # Daily at 7 AM UTC, after overnight ETL
Excluding Columns from Profiling
Some columns should not be profiled — for example, PII columns (to avoid logging sensitive value samples), or columns with very high cardinality (free-text, UUIDs) that produce expensive distinct-count computations.
# Exclude specific columns from profiling
columnFilterPattern:
excludes:
- email_address
- full_name
- national_id
- raw_response_json
Or from the UI:
- Open the table's asset page → Schema tab.
- Click a column.
- Toggle Exclude from profiling → Save.
Column Value Distribution (Histogram)
For numeric columns, profiling computes a histogram with configurable bucket counts:
score distribution (100 rows sample):
0–20 | ▌ 2.1%
21–40 | ██▌ 8.4%
41–60 | █████ 16.9%
61–80 | ████████ 28.3%
81–100 | ████████████ 44.3%
This immediately reveals:
- Whether the data is normally distributed, skewed, or bimodal.
- Whether the full expected range is represented.
- Whether outlier values exist at the extremes.
Value Frequency (String Columns)
For string and low-cardinality columns, profiling shows the top N most common values:
region value frequency:
EMEA | ██████████████ 42.1%
AMER | ████████████ 38.7%
APAC | ████ 15.3%
OTHER | █ 3.9%
This instantly reveals:
- Expected vs unexpected values.
- Data quality issues (e.g.,
EmeaandEMEAas separate values indicating inconsistent casing). - Distribution bias.
Profiling Results in Asset Pages
Finding profiling stats on the Schema tab
- Open any catalogued table.
- Click the Schema tab.
- Each column shows:
- Null % — displayed as a small bar filling based on the null percentage.
- Distinct % — distinct value ratio.
- Quick min/max for numerics, or character length stats for strings.
- Click a column to expand the full profiling card with histogram.
Profiling tab
For a full-page view of all column profiles:
- Open the table.
- Click the Profiling tab (if not visible, it may be under ··· → Profiling).
The Profiling tab also shows the profile history — how statistics have changed across profile runs — which helps detect data drift.
Data Drift Detection
When profiling runs on a schedule, you can track how column statistics change over time.
| Metric to watch | Drift signal |
|---|---|
| Null % | Sudden spike in nulls → upstream pipeline failure |
| Distinct count | Unexplained growth → possible data duplication |
| Min / Max | New extreme value → out-of-range data from upstream |
| Mean | Shift in average → segment mix change or data error |
| Row count | Unexpected drop → data loss; unexpected spike → double-loading |
The Profiling tab shows trend lines for each statistic across historical profile runs. Combined with Data Quality alerts, this gives you early warning of pipeline issues before they affect downstream reporting.
Profiling and PII
When PII-tagged columns are included in profiling scope, the Catalogue applies protective measures:
- Value frequency samples for PII columns are suppressed in the UI (shown as
***). - Profiling statistics (null %, distinct count, min/max) are still computed and shown for schema understanding.
- Actual row samples from profiling are never stored.
To completely exclude a PII column from all profiling:
columnFilterPattern:
excludes:
- email_address
- full_name
Profiling Metrics Reference
| Metric | Numeric | String | Date | Boolean |
|---|---|---|---|---|
| Null count | Yes | Yes | Yes | Yes |
| Null % | Yes | Yes | Yes | Yes |
| Distinct count | Yes | Yes | Yes | Yes |
| Min | Yes | — | Yes | — |
| Max | Yes | — | Yes | — |
| Mean | Yes | — | — | — |
| Median | Yes | — | — | — |
| Std deviation | Yes | — | — | — |
| Sum | Yes | — | — | — |
| Histogram | Yes | — | Yes | — |
| Value frequency | — | Yes | — | Yes |
| Min length | — | Yes | — | — |
| Max length | — | Yes | — | — |
Troubleshooting Profiling
| Symptom | Likely cause | Resolution |
|---|---|---|
| Profile job times out | Table too large, no sampling configured | Add profileSample: 10 (10%) |
| Null % always shows 0 even for known-null columns | Column excluded from profiling filter | Check columnFilterPattern config |
| Histogram shows only one bucket | All values are the same | Expected — data has no variance |
| Distinct count very high for string column | Free-text or UUID column | Exclude from profiling or accept result |
| Profile not showing on asset page | Pipeline has not run successfully | Check ingestion pipeline logs |
| Old profile data showing | Schedule not configured or pipeline not running | Verify cron schedule in pipeline config |