Agent reliability metrics

Reliability is not a feeling

You have a monitoring dashboard from L-0544. It shows you the status of your most important agents. But status is a snapshot — a single frame from a film. Reliability is the film itself. It is the answer to a question that snapshots cannot address: over time, does this agent do what it should, when it should, consistently?

Most people answer this question with feelings. "I'm pretty consistent with my morning routine." "I usually catch myself before I react in anger." "I'm fairly good at planning my week." These self-assessments are not measurements. They are narrative summaries contaminated by recency bias, self-serving distortion, and the fundamental human tendency to remember successes and forget failures. You would never accept "the server is pretty reliable" from an engineering team. You should not accept it from yourself about your own cognitive agents.

Reliability metrics replace narrative with arithmetic. They give you numbers — specific, comparable, trackable numbers — that describe how consistently each agent in your cognitive system performs its function. This lesson teaches you what those numbers are, where they come from, and why they are the foundation on which every other monitoring metric depends.

The engineering origin: SLIs, SLOs, and the reliability stack

The most rigorous framework for measuring system reliability comes from Site Reliability Engineering, formalized by Google in their foundational SRE book (Beyer et al., 2016). The framework defines three layers of reliability measurement.

A Service Level Indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service provided. For a web service, an SLI might be "the proportion of requests that return a successful response within 200 milliseconds." The critical word is "proportion." An SLI is always a ratio: good events divided by total events, measured over a defined window.

A Service Level Objective (SLO) is a target value for an SLI. If your SLI is request success rate, your SLO might be 99.9% — meaning you accept that one in a thousand requests can fail. The SLO is not a description of current performance. It is a declaration of acceptable performance. It defines the boundary between "this system is working well enough" and "this system needs intervention."

A Service Level Agreement (SLA) adds consequences to an SLO. If the SLO is breached, something happens — a penalty, a credit, a contractual obligation. SLAs are the external, enforceable version of SLOs.

This three-layer stack — measurement, target, consequence — is the same structure you need for your cognitive agents. Your SLI is the reliability rate: the proportion of trigger events where the agent fires correctly. Your SLO is the reliability target you set based on the agent's importance. And your SLA is the commitment you make to yourself about what happens when reliability drops below the target — what corrective action you will take, and when.

The reason this framework endures is that it separates observation from aspiration from accountability. You cannot improve what you measure only with feelings. You cannot set meaningful targets without first measuring current performance. And targets without consequences are wishes.

Four numbers that define reliability

Reliability engineering in software systems tracks a family of metrics that reduce complex system behavior to comparable quantities. Four of these translate directly to cognitive agent monitoring.

Mean Time Between Failures (MTBF) measures how long a system operates correctly before it fails. For a cognitive agent, MTBF is the average number of trigger events between misses. If your focused-work agent fails to activate roughly once every five work sessions, your MTBF is five sessions. Higher is better — it means the agent runs longer between failures.

Mean Time To Recovery (MTTR) measures how quickly a system is restored after a failure. For a cognitive agent, MTTR is how long it takes you to notice the agent failed and re-engage it. If your planning agent fails to fire on Monday morning and you do not notice until Wednesday, your MTTR is two days. If you notice by Monday afternoon and re-plan, your MTTR is a few hours. Lower is better — it means you detect and correct failures quickly.

Availability is calculated as MTBF / (MTBF + MTTR). A cognitive agent with an MTBF of twenty trigger events and an MTTR of one trigger event has an availability of 20/21 = 95.2%. This single number captures both how often the agent fails and how quickly you recover. It is the most comprehensive single reliability metric you can track.

Failure Rate is simply 1 / MTBF — how often failures occur per unit of operation. An agent that fails once every ten trigger events has a failure rate of 10%. This is the inverse lens on the same data, useful when you want to compare agents that operate at different frequencies.

These four numbers — MTBF, MTTR, availability, and failure rate — give you a reliability profile for each agent. Two agents can have the same failure rate but dramatically different availability if one has fast recovery and the other does not. The distinction matters because the intervention differs: one agent needs failure prevention, the other needs faster failure detection.

Signal detection: the reliability you are actually measuring

There is a subtlety that pure engineering metrics miss, and it comes from psychology rather than software. When you measure whether a cognitive agent "fires when it should," you are making a detection judgment — and detection judgments have a structure that the field of signal detection theory (Green & Swets, 1966) maps with precision.

Signal detection theory distinguishes four outcomes when a system must decide whether a signal (the trigger condition) is present:

Hit: The trigger is present, and the agent fires. This is correct activation.
Miss: The trigger is present, and the agent does not fire. This is a failure of sensitivity.
False alarm: The trigger is absent, and the agent fires anyway. This is a failure of specificity.
Correct rejection: The trigger is absent, and the agent correctly does not fire. This is appropriate restraint.

Your agent's sensitivity (hit rate) is: hits / (hits + misses). It measures how good the agent is at activating when it should. Your agent's specificity (correct rejection rate) is: correct rejections / (correct rejections + false alarms). It measures how good the agent is at staying quiet when it should.

These two dimensions are independent. An agent can be highly sensitive but poorly specific — it fires at everything, including situations that do not warrant it. Think of an anxiety agent that activates at genuine threats but also at harmless emails, ambiguous comments, and uncertain weather forecasts. Or an agent can be highly specific but poorly sensitive — it almost never fires incorrectly, but it also misses half the genuine triggers. Think of a conflict-resolution agent that works beautifully when it activates but fails to activate in seventy percent of actual conflicts.

The engineering reliability metrics (MTBF, failure rate) capture sensitivity — how often the agent fires when it should. But they do not capture specificity — how often the agent fires when it should not. Signal detection theory adds the dimension that makes your reliability picture complete. A truly reliable agent scores high on both sensitivity and specificity. It fires when appropriate and stays quiet when not.

The error budget: how much unreliability you can afford

One of the most counterintuitive insights from Google's SRE practice is the concept of the error budget. If your SLO is 99.9% availability, then your error budget is 0.1% — the amount of unreliability you have agreed to tolerate. That 0.1% is not a failure. It is a resource. You can "spend" it on deployments, experiments, or changes that carry some risk of disruption, because you have budgeted for that disruption.

The error budget resolves a tension that exists in every system — and in every person. If you pursue 100% reliability, you cannot take risks. You cannot experiment. You cannot change. The system becomes perfectly reliable and perfectly stagnant. The error budget formalizes the recognition that some unreliability is not just tolerable but necessary for growth.

Applied to your cognitive agents: if your daily planning agent has a 90% reliability SLO, your error budget is 10%. That means roughly three days per month where the agent can fail without triggering corrective action. Those three days are not disasters. They are the cost of being human — of having variable energy, unexpected disruptions, and competing demands. The error budget gives you permission to be imperfect within defined bounds, while still holding yourself accountable when imperfection exceeds those bounds.

This is the difference between self-compassion and self-deception. Self-deception says "it's fine, I'll do better next time" with no measurement and no threshold. Self-compassion says "I have a 10% error budget, I have used 7% this month, and I am within bounds." Both are kind. Only one is precise.

What habit tracking gets right — and wrong

The behavioral science literature on habit tracking provides a natural bridge between engineering reliability metrics and personal practice. Research has consistently found that monitoring goal progress significantly increases rates of goal attainment. A meta-analysis of over 19,000 participants demonstrated that simply tracking a behavior — exercise, diet, study habits — makes people more likely to sustain it (Harkin et al., 2016).

The mechanism is straightforward: tracking externalizes the measurement. Instead of relying on your narrative memory ("I think I exercised most days"), you have a record. The record provides an SLI — a quantitative indicator of performance — whether or not you call it that.

But habit tracking research also reveals a critical failure mode. Binary streak tracking — "did I do it today, yes or no?" — collapses the richness of reliability into a single dimension. Research from behavioral economics found that individuals expend 40% more effort to maintain a streak than to achieve the same behavior without streak tracking. The streak becomes the objective rather than the behavior it was meant to measure. When the streak breaks, motivation collapses disproportionately — not because the underlying behavior changed, but because the tracking method created an all-or-nothing frame.

At 24 months, psychology-based tracking groups that measured multiple dimensions of behavior maintained significantly higher performance than streak-only groups. The lesson for agent monitoring is clear: do not track your agents with a single streak counter. Track the full reliability profile — sensitivity, specificity, MTBF, MTTR. A broken streak is not a reliability crisis. It is one data point in a distribution. The distribution is what matters.

Applying reliability metrics to cognitive agents

Here is what reliability measurement looks like in practice for three common cognitive agents.

The daily planning agent. Trigger condition: weekday morning, no travel, no illness. Observation window: past 30 days. You count 22 valid trigger days. The agent fired (you actually planned your day) on 19 of them. It missed on 3. It also fired on 2 weekend mornings that you had designated as unstructured. Sensitivity: 19/22 = 86.4%. Specificity: 6/8 non-trigger days where it correctly did not fire, out of 8 total = 75%. MTBF: approximately 7.3 trigger events between failures. Your SLO for this agent is 90%. You are below target. Your error budget is spent. Corrective action: examine the three miss days. What was different? Were they Mondays? Were they days after poor sleep? The pattern in your misses tells you where the agent's trigger mechanism is weak.

The active-listening agent. Trigger condition: a conversation where the other person is expressing something emotionally significant. This is harder to count precisely, but you can approximate over the past two weeks. You estimate 10 trigger conversations. The agent fired well in 7 — you were genuinely present, asked clarifying questions, reflected back what you heard. It missed in 3 — you caught yourself rehearsing your response, checking your phone, or mentally elsewhere. It also fired in 1 conversation that was purely logistical, where active listening was unnecessary overhead. Sensitivity: 7/10 = 70%. You have significant reliability work to do. The question is not "be a better listener" — that instruction is useless. The question is: what distinguishes the 7 successful fires from the 3 misses? Time of day? Fatigue level? The specific person? The topic? The pattern is where the fix lives.

The emotional regulation agent. Trigger condition: a situation producing anger, frustration, or defensiveness that warrants a pause before reacting. You identify 6 trigger events in the past two weeks. The agent fired correctly in 5 — you paused, took a breath, responded rather than reacted. It missed in 1 — you snapped at a colleague during a stressful meeting. Your sensitivity is 83.3%. But here is where specificity matters: the agent also fired 3 times in situations that were actually low-stakes and did not require regulation — you suppressed a perfectly reasonable expression of mild annoyance because the regulation agent over-activated. Your specificity is poor, and the cost is emotional flattening. Over-reliable regulation is its own failure mode.

Reliability is not the whole picture

Reliability answers one question: does the agent fire when it should and stay quiet when it should not? This is necessary but insufficient. An agent can fire with perfect reliability and still produce poor outcomes. Your planning agent might activate every single morning — 100% sensitivity — and produce a terrible plan every time.

This is why reliability is lesson five in a twenty-lesson phase, not lesson twenty. It is the foundation metric, not the complete picture. Reliability tells you whether the system is operational. It does not tell you whether the system is effective. That distinction — between firing and producing the desired result — is precisely what L-0546 addresses with effectiveness metrics.

The relationship between reliability and effectiveness is sequential: you must establish reliability before effectiveness measurement becomes meaningful. If an agent only fires 50% of the time, you cannot assess its effectiveness because half your data is missing. Reliability gives you a stable observation base. Effectiveness tells you what to do with it.

Think of it in engineering terms. Availability must be established before performance can be measured. You cannot benchmark the throughput of a server that is down half the time. You fix uptime first. Then you optimize throughput. For your cognitive agents: fix reliability first. Then optimize effectiveness.

Your monitoring dashboard from L-0544 now has a reliability layer. Each agent has a sensitivity rate, a specificity rate, an MTBF, and an SLO. You know which agents are within their error budget and which have exceeded it. You know which need sensitivity work and which need specificity work. This is the measurement infrastructure that makes the next step — effectiveness metrics in L-0546 — possible. Reliability tells you the system is running. Effectiveness will tell you whether it is running well.

Sources:

Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media.
Green, D. M., & Swets, J. A. (1966). Signal Detection Theory and Psychophysics. Wiley.
Harkin, B., et al. (2016). "Does Monitoring Goal Progress Promote Goal Attainment? A Meta-Analysis of the Experimental Evidence." Psychological Bulletin, 142(2), 198-229.
Ebeling, C. E. (2010). An Introduction to Reliability and Maintainability Engineering (2nd ed.). Waveland Press.
Google Cloud Blog. (2024). "SRE Fundamentals: SLIs, SLAs, and SLOs."
Pmc.ncbi.nlm.nih.gov. (2010). "Applying Signal-Detection Theory to the Study of Observer Accuracy and Bias in Behavioral Assessment." Journal of Applied Behavior Analysis.
Datadog. (2025). "Machine Learning Model Monitoring in Production: Best Practices."