Skip to main content

Data Profiling

Professional+

Data Profiling automatically scans your tables and computes per-column statistics — null counts, distinct values, min/max ranges, value distributions, and more. Profiling results are surfaced on every asset page, empowering analysts to assess data shape and quality before writing a single query.


What Profiling Computes

Table-level statistics

StatisticDescription
Row countTotal rows in the table at the time of profiling
Column countNumber of columns
TimestampWhen the profile was computed
SizeApproximate table size on disk (where available)

Column-level statistics

StatisticApplies toDescription
Null countAll typesNumber of rows where the column is NULL
Null %All typesNull count / total rows × 100
Distinct countAll typesApproximate count of unique values
Distinct %All typesDistinct count / total rows × 100
MinNumeric, DateSmallest value
MaxNumeric, DateLargest value
MeanNumericArithmetic average
MedianNumeric50th percentile value
Std deviationNumericStandard deviation
SumNumericColumn total
HistogramNumericValue distribution across N buckets
Value frequencyString, Low-cardinalityTop N most common values with counts
Min lengthStringShortest string length (characters)
Max lengthStringLongest string length (characters)
Mean lengthStringAverage string length

Profiling in the Calabi Catalogue

Profiling results appear in two places on every table's asset page:

  1. Schema tab — per-column statistics are shown inline next to each column name.
  2. Profiling tab — a full-page view of all column statistics with histograms and distribution charts.

Reading the Schema tab profiling inline view

Column: score                          INTEGER
Null %: 0.0% Distinct: 94 Min: 23 Max: 100 Mean: 72.4
[=======] Distribution ▶ click to expand histogram

Reading the Profiling tab

The Profiling tab shows a card per column. Numeric columns show a histogram. String columns show a value frequency bar chart.


Setting Up Profiling

Step 1 — Create a Profiler pipeline

  1. Navigate to the service that contains your database (e.g., ServicesDatabase ServicesRedshift Production).
  2. Click Ingestion tab → + Add ingestion.
  3. Select Profiler as the pipeline type.
  4. Configure which schemas and tables to profile:
# Profiler scope configuration
schemaFilterPattern:
includes:
- psychometric_reporting
- dim
tableFilterPattern:
includes:
- scores
- candidates
- exam_metadata
excludes:
- _tmp_*
- stg_*
  1. Configure sampling (see below).
  2. Set the schedule.
  3. Click Deploy.

Step 2 — Configure sampling

Profiling on large tables can be slow and expensive. Use sampling to profile a representative subset:

Sampling methodWhen to use
Percentage sampleProfile a random N% of rows
Row limitProfile only the first N rows
Full tableProfile all rows
# Example: 10% sample
profilerConfiguration:
profileSample: 10 # percentage
profileSampleType: PERCENTAGE
# Example: fixed 500,000 row limit
profilerConfiguration:
profileSample: 500000
profileSampleType: ROWS
Full-table profiling on large tables

Profiling a 500M-row table without sampling can take hours and consume significant compute. Always start with a 5–10% sample and validate results before enabling full-table profiling.


Scheduling Profiling

Data change frequencyRecommended profiling schedule
Hourly / real-time dataDaily (profiling lag is acceptable)
Daily ETLAfter the ETL window completes
Weekly batchWeekly
Reference / lookup tablesMonthly or on schema change

Configure the schedule in the Profiler pipeline:

Cron: 0 7 * * *    # Daily at 7 AM UTC, after overnight ETL

Excluding Columns from Profiling

Some columns should not be profiled — for example, PII columns (to avoid logging sensitive value samples), or columns with very high cardinality (free-text, UUIDs) that produce expensive distinct-count computations.

# Exclude specific columns from profiling
columnFilterPattern:
excludes:
- email_address
- full_name
- national_id
- raw_response_json

Or from the UI:

  1. Open the table's asset page → Schema tab.
  2. Click a column.
  3. Toggle Exclude from profilingSave.

Column Value Distribution (Histogram)

For numeric columns, profiling computes a histogram with configurable bucket counts:

score distribution (100 rows sample):
0–20 | ▌ 2.1%
21–40 | ██▌ 8.4%
41–60 | █████ 16.9%
61–80 | ████████ 28.3%
81–100 | ████████████ 44.3%

This immediately reveals:

  • Whether the data is normally distributed, skewed, or bimodal.
  • Whether the full expected range is represented.
  • Whether outlier values exist at the extremes.

Value Frequency (String Columns)

For string and low-cardinality columns, profiling shows the top N most common values:

region value frequency:
EMEA | ██████████████ 42.1%
AMER | ████████████ 38.7%
APAC | ████ 15.3%
OTHER | █ 3.9%

This instantly reveals:

  • Expected vs unexpected values.
  • Data quality issues (e.g., Emea and EMEA as separate values indicating inconsistent casing).
  • Distribution bias.

Profiling Results in Asset Pages

Finding profiling stats on the Schema tab

  1. Open any catalogued table.
  2. Click the Schema tab.
  3. Each column shows:
    • Null % — displayed as a small bar filling based on the null percentage.
    • Distinct % — distinct value ratio.
    • Quick min/max for numerics, or character length stats for strings.
  4. Click a column to expand the full profiling card with histogram.

Profiling tab

For a full-page view of all column profiles:

  1. Open the table.
  2. Click the Profiling tab (if not visible, it may be under ···Profiling).

The Profiling tab also shows the profile history — how statistics have changed across profile runs — which helps detect data drift.


Data Drift Detection

When profiling runs on a schedule, you can track how column statistics change over time.

Metric to watchDrift signal
Null %Sudden spike in nulls → upstream pipeline failure
Distinct countUnexplained growth → possible data duplication
Min / MaxNew extreme value → out-of-range data from upstream
MeanShift in average → segment mix change or data error
Row countUnexpected drop → data loss; unexpected spike → double-loading

The Profiling tab shows trend lines for each statistic across historical profile runs. Combined with Data Quality alerts, this gives you early warning of pipeline issues before they affect downstream reporting.


Profiling and PII

When PII-tagged columns are included in profiling scope, the Catalogue applies protective measures:

  • Value frequency samples for PII columns are suppressed in the UI (shown as ***).
  • Profiling statistics (null %, distinct count, min/max) are still computed and shown for schema understanding.
  • Actual row samples from profiling are never stored.

To completely exclude a PII column from all profiling:

columnFilterPattern:
excludes:
- email_address
- full_name

Profiling Metrics Reference

MetricNumericStringDateBoolean
Null countYesYesYesYes
Null %YesYesYesYes
Distinct countYesYesYesYes
MinYesYes
MaxYesYes
MeanYes
MedianYes
Std deviationYes
SumYes
HistogramYesYes
Value frequencyYesYes
Min lengthYes
Max lengthYes

Troubleshooting Profiling

SymptomLikely causeResolution
Profile job times outTable too large, no sampling configuredAdd profileSample: 10 (10%)
Null % always shows 0 even for known-null columnsColumn excluded from profiling filterCheck columnFilterPattern config
Histogram shows only one bucketAll values are the sameExpected — data has no variance
Distinct count very high for string columnFree-text or UUID columnExclude from profiling or accept result
Profile not showing on asset pagePipeline has not run successfullyCheck ingestion pipeline logs
Old profile data showingSchedule not configured or pipeline not runningVerify cron schedule in pipeline config