Comparing Runs

Professional+

After running multiple experiments, Calabi ML provides tools to compare runs side-by-side, visualise metric trends, and identify the best-performing configuration. This page covers the comparison UI, chart types, programmatic comparison via the Python SDK, and how to export results for reporting.

Selecting Runs to Compare

In the Calabi ML UI

Navigate to Calabi ML and open the experiment containing the runs you want to compare.
The Runs table shows all runs with their logged parameters, metrics, and tags.
Check the checkboxes next to two or more runs.
Click Compare (appears in the toolbar once two or more runs are selected).

You can select up to 50 runs for simultaneous comparison.

Filtering Runs Before Comparing

Use the filter bar above the runs table to narrow down candidates before selecting:

Filter Type	Example	Use Case
Metric filter	`val_auc > 0.85`	Find runs that cleared a minimum threshold
Parameter filter	`n_estimators = 300`	Compare runs with a specific hyperparameter value
Tag filter	`dataset_version = v3`	Compare only runs trained on the same data
Status filter	`FINISHED`	Exclude crashed or still-running experiments
Date filter	Last 7 days	Focus on recent experiments

The Comparison View

The comparison view presents runs in a structured layout with several panels.

Parameters Panel

A side-by-side table of parameter values across all selected runs. Cells where values differ are highlighted, making it easy to see what changed between runs.

Parameter          Run A          Run B          Run C
─────────────────────────────────────────────────────
n_estimators       100            300            500
max_depth          6              6              4        ← different
learning_rate      0.1            0.05           0.05     ← different
subsample          0.8            0.8            0.9      ← different
colsample_bytree   0.8            0.8            0.8
dataset_version    v3             v3             v3

Metrics Panel

A side-by-side table of all logged metrics across the selected runs. By default, Calabi ML shows the last logged value for each metric. For time-series metrics (logged with step), you can switch to show min, max, or mean.

Metric             Run A          Run B          Run C
─────────────────────────────────────────────────────
train_auc          0.921          0.957          0.953
val_auc            0.847          0.891          0.903   ← Run C wins
test_auc           0.841          0.886          0.899   ← Run C wins
test_f1            0.812          0.858          0.871
training_time_s    42.1           78.4           142.6

Metric Comparison Charts

Switch from the table view to chart view to see how metrics evolve across runs or over training steps.

Metric Comparison Bar Chart

Select any metric and Calabi ML renders a bar chart with one bar per run. Useful for comparing final metric values at a glance.

Training Curves

For metrics logged with step (e.g., loss per epoch), Calabi ML overlays the training curves for all selected runs on one chart. This lets you compare:

Convergence speed (which model reaches peak performance earliest)
Overfitting patterns (gap between train and val curves)
Stability (smooth vs. noisy curves)

To view training curves:

In the comparison view, click the Charts tab.
Select a step-level metric (e.g., val_loss).
Toggle individual runs on/off using the legend.

Scatter Plot

Plot any metric vs. parameter as a scatter plot to identify relationships:

Do runs with higher n_estimators consistently achieve better val_auc?
Is there a max_depth sweet spot after which performance degrades?

Parallel Coordinates Plot

The parallel coordinates chart is the most powerful visualisation for hyperparameter analysis. It places each parameter and metric as a vertical axis and draws one line per run connecting all its values.

To open it:

From the comparison view, click Parallel Coordinates.
All logged parameters and metrics appear as axes.
Drag axes to reorder them — put your primary metric (e.g., val_auc) on the far right.
Drag a selection range on any axis to filter to runs within that range.

What to look for:

Lines that converge towards high val_auc on the right — trace them left to identify the winning hyperparameter combinations.
Axes where lines cluster together — that parameter doesn't matter much.
Axes where lines fan out dramatically — that parameter strongly influences performance.

Finding the Best Run

Via the UI

Sort the runs table by your primary metric:

Click the column header of your target metric (e.g., val_auc).
Click again to sort descending (highest value first).
The top row is your best run.

Via the Python SDK

from mlflow.tracking import MlflowClient

client = MlflowClient(tracking_uri="https://calabi.<your-domain>/mlflow")

# Find all finished runs in an experiment
experiment = client.get_experiment_by_name("churn-prediction-v2")

runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    filter_string="status = 'FINISHED' AND metrics.val_auc > 0.85",
    order_by=["metrics.val_auc DESC"],
    max_results=10,
)

best_run = runs[0]
print(f"Best run ID:   {best_run.info.run_id}")
print(f"Best val AUC:  {best_run.data.metrics['val_auc']:.4f}")
print(f"Parameters:    {best_run.data.params}")

Searching with Multiple Criteria

runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    filter_string=(
        "status = 'FINISHED' "
        "AND metrics.val_auc > 0.88 "
        "AND metrics.training_time_s < 120 "
        "AND params.dataset_version = 'v3'"
    ),
    order_by=["metrics.val_auc DESC", "metrics.training_time_s ASC"],
)

The filter DSL supports:

metrics.<name> — logged metric values
params.<name> — logged parameter values
tags.<name> — run tags
status — RUNNING, FINISHED, FAILED, KILLED
start_time — Unix timestamp in milliseconds

Programmatic Comparison

Build a comparison DataFrame for custom analysis:

import mlflow
import pandas as pd

mlflow.set_tracking_uri("https://calabi.<your-domain>/mlflow")

runs = mlflow.search_runs(
    experiment_names=["churn-prediction-v2"],
    filter_string="status = 'FINISHED'",
)

# runs is a Pandas DataFrame with columns:
# run_id, status, start_time, end_time, metrics.*, params.*, tags.*

# Show top 10 runs by val_auc
top_runs = (
    runs[["run_id", "params.n_estimators", "params.max_depth",
          "params.learning_rate", "metrics.val_auc", "metrics.test_auc"]]
    .sort_values("metrics.val_auc", ascending=False)
    .head(10)
)
print(top_runs.to_string(index=False))

Computing Summary Statistics

print("val_auc statistics across all runs:")
print(runs["metrics.val_auc"].describe())

print(" Best parameters by val_auc quartile:")
print(
    runs.groupby(pd.qcut(runs["metrics.val_auc"], 4))
    [["params.n_estimators", "params.max_depth", "params.learning_rate"]]
    .mean()
)

Downloading Comparison as CSV

From the UI

In the experiment's runs table, select the runs you want to export.
Click Download CSV (top-right of the runs table).
The downloaded file contains all parameters, metrics, and tags for the selected runs.

Via the Python SDK

runs = mlflow.search_runs(experiment_names=["churn-prediction-v2"])
runs.to_csv("churn_experiment_results.csv", index=False)
print(f"Exported {len(runs)} runs to CSV")

Comparing Runs Across Experiments

When you have related experiments (e.g., different model architectures trained on the same data), compare them by passing multiple experiment names or IDs:

runs = mlflow.search_runs(
    experiment_names=[
        "churn/logistic_baseline",
        "churn/gradient_boosting",
        "churn/neural_network",
    ],
    filter_string="status = 'FINISHED'",
    order_by=["metrics.val_auc DESC"],
)

# Add experiment name column for clarity
runs["experiment_name"] = runs["experiment_id"].map(
    {exp.experiment_id: exp.name
     for exp in mlflow.search_experiments()}
)

print(runs[["experiment_name", "metrics.val_auc", "metrics.test_auc"]].head(20))

Acting on Comparison Results

Once you've identified the best run, the typical next steps are:

Register the model — promote it to the Calabi ML Model Registry.
Document the winning configuration — update the run description with a summary of why this run was selected.
Tag the winning run — apply a status: champion tag for easy future discovery.

client = MlflowClient(tracking_uri="https://calabi.<your-domain>/mlflow")
best_run_id = runs.iloc[0]["run_id"]

# Tag the best run
client.set_tag(best_run_id, "status", "champion")
client.set_tag(best_run_id, "selected_by", "alice@example.com")
client.set_tag(best_run_id, "selection_date", "2026-04-06")

# Register the model
mlflow.register_model(
    model_uri=f"runs:/{best_run_id}/model",
    name="churn-predictor",
)

Next Steps

Model Registry — Promote the best run's model to staging and production
Logging Runs — Ensure you're logging the metrics needed for meaningful comparison
Experiments — Organise your runs into well-named experiments

Selecting Runs to Compare​

In the Calabi ML UI​

Filtering Runs Before Comparing​

The Comparison View​

Parameters Panel​

Metrics Panel​

Metric Comparison Charts​

Metric Comparison Bar Chart​

Training Curves​

Scatter Plot​

Parallel Coordinates Plot​

Finding the Best Run​

Via the UI​

Via the Python SDK​

Searching with Multiple Criteria​

Programmatic Comparison​

Computing Summary Statistics​

Downloading Comparison as CSV​

From the UI​

Via the Python SDK​

Comparing Runs Across Experiments​

Acting on Comparison Results​

Next Steps​

Selecting Runs to Compare

In the Calabi ML UI

Filtering Runs Before Comparing

The Comparison View

Parameters Panel

Metrics Panel

Metric Comparison Charts

Metric Comparison Bar Chart

Training Curves

Scatter Plot

Parallel Coordinates Plot

Finding the Best Run

Via the UI

Via the Python SDK

Searching with Multiple Criteria

Programmatic Comparison

Computing Summary Statistics

Downloading Comparison as CSV

From the UI

Via the Python SDK

Comparing Runs Across Experiments

Acting on Comparison Results

Next Steps