SLOs for agents.

Most teams shipping agents to production today measure “eval scores.” Eval scores are model-quality metrics — accuracy on a benchmark, BLEU or ROUGE against a reference, win-rate against a baseline.

They are not reliability metrics. The difference is what determines whether the agent is actually production-ready, and it is the gap most agent programs fall into.

The signals are not the same

The Google SRE team gave the industry the four golden signals — latency, traffic, errors, saturation — and a generation of services were instrumented against them. There is no equivalent for agents yet.

The reason is not that agents are too new. The reason is that the existing signals do not translate. A retry is not an error — it is often the correct response to a transient failure. A refusal is not a failure — it is often the correct response to ambiguity. A correct answer at the wrong moment, or against the wrong tool, is not a success in any sense the customer cares about.

Agents need their own signals, and most teams do not yet know what they are.

What the gap actually costs

Most enterprise agent programs measure model quality and then ship to production hoping reliability follows. It does not. MIT Project NANDA reported in 2025 that roughly 95% of GenAI pilots produce no measurable return, and a significant portion of those failures are reliability failures, not quality failures.

The agent works in demo. It does not work at three a.m., under load, on a slightly malformed request, against a downstream dependency that returned a partial response. The eval score does not catch that. Reliability is the gap between the eval and the customer, and it is where most agent programs lose their value.

The cost is paid by the on-call engineer at three a.m. and by the customer the next morning. It is also paid by the next program — once the first agent has produced an outage that the team cannot explain, the engineering organization quietly reverts to manual workflows. The agent is technically still deployed. It is also no longer trusted. That state is worse than not having shipped at all.

Six SLOs that matter

These are the six SLO categories I think every team shipping a production agent should be writing against. Some of them have analogs in traditional service reliability. None of them are imported wholesale.

1. Resolution rate. Of all requests received, what percentage reach an acceptable terminal state? Not “no error returned” — actually solved, or correctly escalated to a human, or correctly refused. This is the agent equivalent of availability, but with a sharper definition: an agent that returns no error but punts every request to a fallback is not available, even if its uptime dashboard is green.

2. Time to resolution. From request received to terminal state, including all retries and escalations, measured at p50, p95, and p99. Single-call latency is the wrong number — end-to-end is what the customer experiences. An agent that retries five times in two seconds is not slow. An agent that returns “success” in two hundred milliseconds while having quietly handed off to a human is fast on the wrong metric.

3. Cost per resolution. Tokens, tool calls, downstream service costs, and compute, all aggregated per resolved request rather than per agent invocation. This is the metric the CFO will eventually demand, and it is the one that catches retries-as-cost-amplifiers before they show up in the bill. A 2% increase in retries can blow a budget by 20% if it is not measured at the resolution level.

4. Refusal correctness. When the agent declines to act, was it right to decline? An agent that refuses when genuinely uncertain is doing its job. An agent that refuses on tasks it should have completed is broken, even if its refusal looked polite. This SLO is tracked as a quality dimension on the refusal stream specifically, separate from the main quality metric, because refusals tend to hide failure modes that aggregated quality scores miss.

5. Quality regression. A held-out test set evaluated continuously against the deployed agent, with the SLO defined on the delta over time rather than the absolute score. The agent that scored 92% at launch and now scores 89% has regressed, and a 3% regression that goes undetected for two months is exactly how production agents silently degrade. The eval suite is not just a pre-deployment gate. It is a production signal.

6. Behavioral drift. Distribution shift in inputs, outputs, or model behavior over time, independent of quality scores. A drift signal can fire while quality looks stable — and it is often the first indication that something around the agent has changed: new request types from a feature launch upstream, new tool failures from a dependency upgrade, new edge cases the eval suite has not yet learned to cover. Drift is a leading indicator. Treat it as a first-class signal, not a quarterly report.

What to do with this

These six SLOs do not require new tools. They require treating the agent as a service with its own definitions of success, error, and degradation, and then instrumenting accordingly.

Most of the signal comes from the orchestration layer — the layer that decides what runs when, retries what fails, and records what happened. If the orchestration layer was built for prototyping rather than production, the signals will not be there. The retry counter is missing. The terminal-state classification is missing. The cost rollup is missing. Build the orchestration for the SLOs you intend to write, not the other way around.

When to talk to us

If you are scoping an agent program and want a second set of eyes on the reliability side of the work — what to measure, what to instrument, what to alert on — talk to us. Reliability is what determines whether the agent is trusted. Trust is what determines whether it ships.

Request a consultation