Monitoring Agent Systems
Observability for agent fleets: what to track, how to alert, and how to debug.
The Observability Challenge
Traditional software observability focuses on latency, error rates, and throughput. Agent systems add a new dimension: quality. An agent can execute successfully (no errors, normal latency) while producing completely wrong results.
You need metrics for both operational health and output quality.
Key Metrics to Track
Operational:
- Task completion rate (% that reach a defined end state)
- Error rate by error type (tool failure, timeout, max iterations, etc.)
- Latency percentiles (p50, p95, p99)
- Token usage and cost per task type
- Queue depth (for async systems)
Quality:
- Eval score on sample of production outputs
- Human rating on sampled outputs
- Re-prompt rate (proxy for first-attempt quality)
- Task abandonment rate (user gave up)
Tracing Agent Execution
Every agent invocation should produce a trace: a timeline of every tool call, model call, and decision, with inputs and outputs at each step.
Use a tracing tool (Langfuse, Langsmith, or custom) to capture these traces. Good traces make debugging 10x faster.
Alert Thresholds
| Metric | Warning | Critical | |--------|---------|----------| | Error rate | >2% | >10% | | Latency p95 | >5s | >30s | | Cost/task | 2x baseline | 5x baseline | | Completion rate | <95% | <80% |
The Debugging Loop
When an agent fails:
- Pull the full trace for the failing task
- Identify the exact step where things went wrong
- Reproduce the failure in isolation
- Fix the specific component
- Run the full task again to verify
- Add a regression test
Never debug agent failures by running the full system repeatedly — too slow and too noisy.