Monitoring Agent Systems

The Observability Challenge

Traditional software observability focuses on latency, error rates, and throughput. Agent systems add a new dimension: quality. An agent can execute successfully (no errors, normal latency) while producing completely wrong results.

You need metrics for both operational health and output quality.

Key Metrics to Track

Operational:

Task completion rate (% that reach a defined end state)
Error rate by error type (tool failure, timeout, max iterations, etc.)
Latency percentiles (p50, p95, p99)
Token usage and cost per task type
Queue depth (for async systems)

Quality:

Eval score on sample of production outputs
Human rating on sampled outputs
Re-prompt rate (proxy for first-attempt quality)
Task abandonment rate (user gave up)

Tracing Agent Execution

Every agent invocation should produce a trace: a timeline of every tool call, model call, and decision, with inputs and outputs at each step.

Use a tracing tool (Langfuse, Langsmith, or custom) to capture these traces. Good traces make debugging 10x faster.

Alert Thresholds

| Metric | Warning | Critical | |--------|---------|----------| | Error rate | >2% | >10% | | Latency p95 | >5s | >30s | | Cost/task | 2x baseline | 5x baseline | | Completion rate | <95% | <80% |

The Debugging Loop

When an agent fails:

Pull the full trace for the failing task
Identify the exact step where things went wrong
Reproduce the failure in isolation
Fix the specific component
Run the full task again to verify
Add a regression test

Never debug agent failures by running the full system repeatedly — too slow and too noisy.

The Observability Challenge

Key Metrics to Track

Tracing Agent Execution

Alert Thresholds

The Debugging Loop

AI Agents