Learning CenterAI AgentsMonitoring Agent Systems
Advanced7 min read

Monitoring Agent Systems

Observability for agent fleets: what to track, how to alert, and how to debug.

The Observability Challenge

Traditional software observability focuses on latency, error rates, and throughput. Agent systems add a new dimension: quality. An agent can execute successfully (no errors, normal latency) while producing completely wrong results.

You need metrics for both operational health and output quality.

Key Metrics to Track

Operational:

  • Task completion rate (% that reach a defined end state)
  • Error rate by error type (tool failure, timeout, max iterations, etc.)
  • Latency percentiles (p50, p95, p99)
  • Token usage and cost per task type
  • Queue depth (for async systems)

Quality:

  • Eval score on sample of production outputs
  • Human rating on sampled outputs
  • Re-prompt rate (proxy for first-attempt quality)
  • Task abandonment rate (user gave up)

Tracing Agent Execution

Every agent invocation should produce a trace: a timeline of every tool call, model call, and decision, with inputs and outputs at each step.

Use a tracing tool (Langfuse, Langsmith, or custom) to capture these traces. Good traces make debugging 10x faster.

Alert Thresholds

| Metric | Warning | Critical | |--------|---------|----------| | Error rate | >2% | >10% | | Latency p95 | >5s | >30s | | Cost/task | 2x baseline | 5x baseline | | Completion rate | <95% | <80% |

The Debugging Loop

When an agent fails:

  1. Pull the full trace for the failing task
  2. Identify the exact step where things went wrong
  3. Reproduce the failure in isolation
  4. Fix the specific component
  5. Run the full task again to verify
  6. Add a regression test

Never debug agent failures by running the full system repeatedly — too slow and too noisy.

Loading…