Troubleshooting Pipelines

Professional+

This guide covers the most common failure patterns in Calabi Pipelines, how to read task logs effectively, and a structured debug checklist to work through when a pipeline stops working as expected.

Debug Checklist

When a pipeline fails, work through this checklist in order before diving into specific symptoms:

Is the pipeline unpaused? Check the toggle in the DAGs list — a grey toggle means no new runs will be created and existing runs may be affected.
What is the task state? Navigate to the Grid view and identify which task is failed, up_for_retry, or skipped unexpectedly.
Read the full task log. Scroll to the bottom — most errors appear at the end, not in the middle.
Check the attempt number. If the task retried, compare attempt logs to see if the error changed between attempts.
Check external dependencies. Is the database reachable? Did the source API return an error? Did an upstream pipeline fail?
Check resource usage. Is the task being killed by an OOM error or hitting a CPU/time limit?
Check for recent code changes. Was the DAG file modified recently? Does the error appear after a deployment?
Check the scheduler health. If multiple pipelines are failing simultaneously, the issue may be platform-level rather than DAG-specific.

Common Failure Patterns

1. Import Error / Syntax Error at Parse Time

Symptom: Pipeline does not appear in the DAGs list, or shows a red broken-pipe icon. No task instances are created.

Cause: The DAG Python file has a syntax error, a bad import, or a top-level exception.

How to diagnose:

# From the Calabi Pipelines worker container or local dev environment:
python /path/to/your_dag.py

Or check the Import Errors panel in the UI: DAGs → Import Errors (visible only when errors exist).

Log example:

[ERROR] Failed to import: /opt/airflow/dags/customer_orders.py
Traceback (most recent call last):
  File "customer_orders.py", line 5, in <module>
    from mypackage import helper_fn
ModuleNotFoundError: No module named 'mypackage'

Resolution:

Fix the import or syntax error.
Ensure the Python package is installed in the Calabi Pipelines image.
If using a custom package, verify it is listed in requirements.txt or the Calabi Pipelines image build.

2. Task Stuck in `queued` State

Symptom: Task instances show as queued (purple) for an unusually long time without transitioning to running.

Causes and resolutions:

Cause	Resolution
All worker slots are occupied	Increase worker concurrency or scale up worker replicas
`max_active_tasks` limit reached	The DAG-level or global limit is capping parallelism; increase or stagger pipelines
Executor is unhealthy	Check the executor logs (Celery / Kubernetes) for errors
Task pools are exhausted	Navigate to Admin → Pools and check the pool's used vs. total slots
Worker pod failed to start (Kubernetes)	Check Kubernetes pod events for image pull errors, resource quota limits, etc.

3. Task Fails with `Killed` or Exit Code 137

Symptom: Task log ends abruptly with Killed or the process exits with code 137 (which is SIGKILL = OOM).

Cause: The task exceeded the memory limit of its worker container and was killed by the OS or Kubernetes.

Log example:

[2026-04-04, 08:15:02 UTC] {subprocess.py:175} INFO - Output:
Processing batch 1 of 200...
Processing batch 2 of 200...
Killed
[2026-04-04, 08:15:02 UTC] {taskinstance.py:1780} ERROR - Task exited with return code 137

Resolutions:

Reduce memory usage: process data in smaller chunks, avoid loading full datasets into memory.
Use streaming reads (e.g., Pandas chunksize, database cursors).
Request a higher memory limit for the task by using a KubernetesPodOperator with explicit resource requests.

from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from kubernetes.client import models as k8s

heavy_task = KubernetesPodOperator(
    task_id="heavy_computation",
    image="my-registry/heavy-job:latest",
    container_resources=k8s.V1ResourceRequirements(
        requests={"memory": "4Gi", "cpu": "2"},
        limits={"memory": "8Gi", "cpu": "4"},
    ),
)

4. Task Fails with `AirflowTaskTimeout`

Symptom: Task fails with a timeout error after a fixed duration.

Cause: The task exceeded execution_timeout set on the operator, or the DAGRun exceeded dagrun_timeout.

Log example:

airflow.exceptions.AirflowTaskTimeout: Timeout, PID: 12345

Resolution:

Increase execution_timeout if the task legitimately needs more time:

from datetime import timedelta

@task(execution_timeout=timedelta(hours=2))
def slow_process():
    ...

If the task hangs rather than just being slow, look for a deadlock or infinite loop in your code.
For database queries, check for missing indexes or runaway queries in the database slow-query log.

5. Zombie Tasks

Symptom: A task remains in running state for far longer than expected. The worker may have crashed or the process may have been killed without notifying the scheduler.

Cause: The worker process died (OOM, pod eviction, network partition) without setting the task's final state in the metadata database. The scheduler eventually detects the orphaned task and marks it zombie.

How to resolve:

Navigate to Browse → Task Instances and filter by state running.
Identify task instances running far longer than their normal duration.
Click the task instance and select Clear (which resets the state to none so it can be re-queued).
Alternatively, use the Mark Failed option if you want the retry logic to kick in.

Prevention:

Ensure workers have sufficient memory to avoid OOM kills.
Set appropriate dagrun_timeout so the scheduler detects stale runs.
Use health checks and liveness probes on worker pods in Kubernetes deployments.

6. Sensor Timeout / `up_for_reschedule` Loop

Symptom: A sensor task (e.g., S3KeySensor, HttpSensor) never completes and eventually fails with AirflowSensorTimeout.

Cause: The condition being polled never became true within the timeout window.

Log example:

airflow.exceptions.AirflowSensorTimeout: Snap. Time is OUT.

Resolution:

Verify that the thing being waited for actually arrives (check the source system).
Increase the timeout if the arrival time is variable:

from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor

wait_for_file = S3KeySensor(
    task_id="wait_for_upload",
    bucket_name=Variable.get("s3_input_bucket"),
    bucket_key="data/{{ ds }}/orders.csv",
    aws_conn_id="aws_default",
    timeout=60 * 60 * 6,   # 6 hours
    poke_interval=300,       # check every 5 minutes
    mode="reschedule",       # release worker slot between checks
)

Use mode="reschedule" (not mode="poke") to avoid holding a worker slot while waiting.

7. Connection / Credential Errors

Symptom: Task fails immediately with an authentication or connection error.

Log examples:

psycopg2.OperationalError: FATAL: password authentication failed for user "datauser"

botocore.exceptions.ClientError: An error occurred (403) when calling the GetObject operation: Forbidden

Resolution checklist:

Navigate to Admin → Connections and verify the connection ID matches what is used in the DAG (postgres_conn_id, aws_conn_id, etc.).
Test the connection using the Test button in the connection editor.
Check that credentials have not expired (OAuth tokens, temporary AWS credentials).
For database connections, verify network access: the Calabi Pipelines worker must be able to reach the database host and port.
Check for recent password rotations or permission changes in the source system.

8. Scheduler Not Creating Runs

Symptom: A pipeline with a scheduled interval is not producing new runs at the expected time.

Causes and checks:

Check	How to Verify
Pipeline is paused	Check the toggle in the DAGs list — must be active (blue)
`start_date` is in the future	Calabi Pipelines will not create runs before `start_date`
`end_date` is set and has passed	The pipeline stops scheduling after `end_date`
`catchup=False` and no missed runs	With `catchup=False`, only the next scheduled run is created; no backfill
Scheduler process unhealthy	Check platform health in the admin panel or Kubernetes pod logs
DAG parse error	Broken DAG files prevent schedule evaluation

9. XCom / TaskFlow Data Transfer Failures

Symptom: A downstream task cannot read data returned by an upstream @task function. Error mentions None or missing XCom key.

Cause: The upstream task may have been skipped, failed, or returned None unexpectedly. XComs are stored in the metadata database and are limited in size (default 48 KB on standard installs).

Log example:

ValueError: XCom value for key 'return_value' from task 'compute_revenue' is None

Resolution:

Ensure the upstream task is genuinely succeeding and returning the expected value.
If transferring large datasets, do not use XComs — write to intermediate storage (S3, a database table) and pass only the reference (path, query, key) through XComs.
For large XComs, consider enabling the S3 XCom backend.

Reading Task Logs Effectively

Always start at the bottom — the final error and traceback appear at the end.
Look for the first ERROR line — this is usually the root cause, before any cascade of subsequent errors.
Check the attempt number — if the task retried, open each attempt's log to see if the error changed.
Search for keywords — use browser Ctrl+F to search for ERROR, Exception, Traceback, or the name of an external service.
Check timestamps — unusual gaps in timestamps can indicate the task was waiting (for a lock, a slow query, a rate limit).
Look for SIGTERM / SIGKILL — these indicate the task was killed externally (OOM, timeout, pod eviction).

Getting Platform-Level Help

If your issue cannot be resolved at the DAG level:

Check the Calabi Pipelines Scheduler logs in the platform admin panel or Kubernetes pod logs (calabi-pipelines-scheduler pod).
Review the Calabi Pipelines Worker logs for executor-level errors.
For persistent platform issues, contact your Calabi platform administrator or raise a support ticket through the Calabi support portal.

Next Steps

Monitoring Pipelines — Understand states, SLAs, and alerts
Variables & Connections — Fix credential and configuration issues

Debug Checklist​

Common Failure Patterns​

1. Import Error / Syntax Error at Parse Time​

2. Task Stuck in queued State​

3. Task Fails with Killed or Exit Code 137​

4. Task Fails with AirflowTaskTimeout​

5. Zombie Tasks​

6. Sensor Timeout / up_for_reschedule Loop​

7. Connection / Credential Errors​

8. Scheduler Not Creating Runs​

9. XCom / TaskFlow Data Transfer Failures​

Reading Task Logs Effectively​

Getting Platform-Level Help​

Next Steps​

Debug Checklist

Common Failure Patterns

1. Import Error / Syntax Error at Parse Time

2. Task Stuck in `queued` State

3. Task Fails with `Killed` or Exit Code 137

4. Task Fails with `AirflowTaskTimeout`

5. Zombie Tasks

6. Sensor Timeout / `up_for_reschedule` Loop

7. Connection / Credential Errors

8. Scheduler Not Creating Runs

9. XCom / TaskFlow Data Transfer Failures

Reading Task Logs Effectively

Getting Platform-Level Help

Next Steps