Error Handling and Reliability
Build workflows that recover gracefully from failures and never lose data.
The Reliability Imperative
Production workflows must be reliable. A workflow that fails 5% of the time in production is broken — it just takes longer to notice. Every failure mode needs a handled path.
The Failure Taxonomy
Transient failures — network timeouts, rate limits, temporary service unavailability. Fix: retry with backoff.
Data failures — unexpected input format, missing required fields, type mismatches. Fix: validate inputs before processing, route invalid data to a review queue.
Logic failures — your workflow has a bug. Fix: test coverage, monitoring, immediate alerting.
Dependency failures — an upstream service is down. Fix: circuit breaker, degraded mode, alerting.
Retry Logic
Implement exponential backoff for all external API calls:
Attempt 1: immediate
Attempt 2: wait 1s
Attempt 3: wait 2s
Attempt 4: wait 4s
Attempt 5: fail permanently
Add jitter (random variation) to prevent thundering herd when many workflows fail simultaneously.
The Dead Letter Queue
Any item that fails after all retries goes to a Dead Letter Queue (DLQ) — not the trash. The DLQ stores failed items with full context (input data, error, timestamp) for manual review.
Without a DLQ, failures are silent and data is lost. With a DLQ, failures are visible and recoverable.
Alerting
Alert on:
- Any item entering the DLQ
- Workflow error rate > 1%
- Workflow execution time > 2x normal
- Workflow stopped executing (missed expected runs)
Errors that don't alert are errors you'll find out about from users.