Define success metrics for each agent

You are already monitoring. You are just doing it badly.

Every cognitive agent you run — every habit, heuristic, review process, and decision protocol — already has an implicit success criterion. When you abandon a morning routine, you've concluded it's failing. When you keep using a particular decision framework, you've concluded it works. You are already evaluating your agents. The problem is that you're doing it with vague feelings instead of defined metrics, which means your evaluations are slow, inconsistent, and vulnerable to every cognitive bias in the book.

The previous lesson established that you cannot improve what you do not monitor. This lesson asks the harder question: monitor what, exactly?

Monitoring without defined success metrics is like opening your car's hood and staring at the engine. You're looking, but you don't know what "good" looks like for any specific component. Is the oil level fine? Is the belt tension correct? Without criteria, observation is just attention without insight.

Operational definitions: making the invisible measurable

In 1927, physicist Percy Bridgman published The Logic of Modern Physics and introduced the concept of operationalism — the principle that a scientific concept is defined entirely by the operations used to measure it. "Length" is not an abstract property; it is the number of times a standard ruler fits end-to-end alongside an object. "Temperature" is not heat you feel; it is the reading on a calibrated thermometer under specified conditions.

Bridgman's insight was born from necessity. Working at pressures nearly 100 times higher than anyone had previously achieved, he found that all existing pressure gauges broke down. He couldn't measure pressure by referencing a shared understanding of what pressure "is." He had to define it operationally — as the output of a specific measurement procedure — and build new instruments from there. The definition was the measurement procedure.

This matters for your cognitive agents because most people describe agent success in terms that cannot be operationalized. "My weekly review should help me stay on track." What does "on track" mean? How would you measure it? At what threshold does "on track" become "off track"? Until you can answer these questions with specific procedures, you are not defining success — you are describing a feeling.

An operational definition of success for a weekly review agent might be: "During each review, I identify at least two tasks that were blocking progress without my awareness, and I reassign or eliminate them. Success rate: percentage of reviews where this identification occurs." Now you have something you can count, trend, and act on.

The anatomy of a good metric

Not all metrics are created equal. Locke and Latham's goal-setting theory — built on over 1,000 studies across 88 different tasks since the 1960s — established that specific, difficult goals lead to performance over 250% higher than vague "do your best" instructions. But the research also reveals a critical nuance: the goals must be clear enough that the person pursuing them can unambiguously determine whether they've succeeded.

A well-defined success metric for a cognitive agent has four properties:

Observable. Someone other than you could verify it. "I felt productive" is not observable. "I completed all three planned deep-work blocks before 2pm" is observable. The test is simple: could a camera capture the evidence?

Bounded. It has a threshold that separates success from failure. Not "more reading," but "finished at least 30 pages per session" or "completed two articles from my reading queue per week." The boundary is what transforms a vague aspiration into a testable claim.

Timely. It is measured at a defined frequency. A metric you check "eventually" is not a metric — it is a hope. Specify whether you evaluate daily, weekly, or per-instance. This directly connects to the next lesson on monitoring frequency.

Resistant to gaming. This is where most personal metrics fail. If you can hit the metric while completely undermining the purpose of the agent, the metric is wrong. Checking off "journal entry completed" while writing three meaningless sentences hits the metric and defeats the agent.

Leading and lagging: the two kinds of signal

Every agent produces two types of measurable signal. Confusing them is one of the most common metric failures.

Lagging indicators tell you what already happened. They confirm or deny that the agent produced its intended outcome. Revenue is a lagging indicator of sales activity. Body weight is a lagging indicator of nutrition and exercise habits. "Number of decisions I regretted this quarter" is a lagging indicator for a decision-making agent.

Lagging indicators are authoritative but slow. By the time a lagging indicator shows a problem, the damage is already done. You can't steer a car by looking at where you've been.

Leading indicators predict future outcomes. They measure the inputs and behaviors that cause the lagging indicator to move. Number of sales calls is a leading indicator of revenue. Daily calorie tracking is a leading indicator of body weight. "Percentage of decisions where I explicitly wrote out three alternatives before choosing" is a leading indicator for a decision-making agent.

Leading indicators are actionable but uncertain. They tell you whether the process is running, not whether it is producing results. You can execute the leading indicator perfectly and still get a bad outcome — but over time, strong leading indicators reliably predict strong lagging indicators.

The mistake most people make is measuring only lagging indicators because they feel more "real." You track whether you lost weight but not whether you logged your food. You track whether your decisions turned out well but not whether you applied your decision protocol. This is like measuring a company's quarterly earnings and ignoring whether anyone is making sales calls. By the time the lagging indicator fails, you have no data to diagnose why.

Every agent needs at least one of each. A leading indicator tells you whether the agent is running its process. A lagging indicator tells you whether the process is producing the right results.

The metric trap: Goodhart, Campbell, and McNamara

Defining metrics creates a new failure mode that is arguably worse than having no metrics at all.

Charles Goodhart, a British economist, observed in 1975 what has become known as Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." The paraphrase by anthropologist Marilyn Strathern is sharper: any observed regularity collapses once pressure is placed on it for control purposes.

Donald Campbell, working independently in social psychology, arrived at a parallel conclusion now called Campbell's Law: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor."

The most devastating historical example is what Daniel Yankelovich called the McNamara fallacy. Robert McNamara, U.S. Secretary of Defense from 1961 to 1968, applied Ford Motor Company's management-by-metrics philosophy to the Vietnam War. The chosen metric was enemy body count. Officers were promoted based on kill numbers. The metric was precise, quantitative, and completely wrong — it incentivized false reporting, ignored strategic context, and provided a green dashboard while the war was being lost. Yankelovich described the progression: first measure whatever can be easily measured, then disregard what cannot be measured, then presume that what cannot be measured is not important, then declare that what cannot be measured does not exist.

This is not just a government problem. It is a you problem. Every metric you define for your cognitive agents creates an incentive gradient. If you measure "number of items processed in weekly review," you will unconsciously speed through items to hit the number. If you measure "hours spent in deep work," you will sit at your desk in a fog because the clock is running. The metric shapes behavior — and if the metric is a poor proxy for what you actually care about, it shapes behavior in exactly the wrong direction.

Management accounting research identifies the underlying mechanism as surrogation: the tendency for the measure to psychologically replace the construct it was designed to represent. You stop asking "is my reading agent making me a better thinker?" and start asking "did I read 30 pages today?" The proxy consumes the purpose.

The mitigation is not to abandon metrics. It is to treat every metric as a hypothesis about what matters, and to pair quantitative metrics with qualitative judgment. The research is specific: incorporating subjective narratives alongside metrics reduces both surrogation and distortion by restoring context that numbers alone cannot carry.

Precision and recall: a framework borrowed from machine learning

Machine learning offers a useful vocabulary for thinking about agent metrics, because ML engineers face the same fundamental problem: a model can fail in two different directions, and the metric you choose determines which failure you see.

Precision asks: of all the times the agent fired, how many times did it produce a correct result? A high-precision decision heuristic means that when it triggers a decision, the decision is usually right — but it might miss cases where it should have triggered.

Recall asks: of all the cases where the agent should have fired, how many did it actually catch? A high-recall reading filter means you rarely miss an important article — but you might also flag a lot of irrelevant ones.

You cannot maximize both. Improving precision (fewer false positives) typically reduces recall (more false negatives), and vice versa. A spam filter that never lets spam through will also block legitimate emails. A decision protocol that never produces a bad decision will also prevent you from making many good ones.

The F1 score balances precision and recall equally, but the real insight is that the right balance depends on the cost of each type of error. In medical diagnosis, missing a cancer (false negative) is far worse than a false alarm (false positive), so you optimize for recall. In spam filtering, blocking a critical email (false positive) might be worse than letting spam through, so you optimize for precision.

For your cognitive agents, ask: is it worse when this agent fires incorrectly, or when it fails to fire at all? A financial decision heuristic that triggers unnecessary caution (low precision, high recall) might be acceptable. A trust-assessment heuristic that misses betrayal signals (high precision, low recall) might be catastrophic. The cost asymmetry determines which metric matters more.

What changes when AI enters the measurement loop

Every metric discussion above assumes you are the sole measurer of your own agents. This is a bottleneck. You have limited attention, inconsistent memory, and strong motivated reasoning about whether your own systems are working.

An AI system with access to your externalized thinking can measure things you cannot measure about yourself. It can track how often your decision journal entries show genuine consideration of alternatives versus post-hoc rationalization. It can identify patterns in when your weekly review produces actionable insights versus when it degenerates into list maintenance. It can flag that your reading agent's output has shifted from diverse sources to an increasingly narrow set — a form of agent drift that is nearly impossible to detect from the inside.

The progression mirrors the earlier lessons on externalization. Your biological system generates the agents. Your second brain (the capture and documentation system) preserves the data those agents produce. Your third brain (AI) can analyze that data with a consistency and pattern-recognition capacity that human self-monitoring cannot match.

But this only works if the metrics are operationally defined. AI cannot evaluate "did my morning routine make me feel good?" It can evaluate "did my morning routine complete all steps within the target time window, and was my self-reported energy rating above 3?" The operational definition is what makes the metric legible to both human and machine evaluation.

The protocol

Defining success metrics for your agents is not a one-time exercise. It is a recurring practice that sharpens over time:

Start with the purpose, not the metric. For each agent, write one sentence: "This agent exists to produce [specific outcome]." If you cannot articulate the outcome, the agent has no success criteria to define.
Operationalize ruthlessly. Convert every abstract term in your purpose statement into an observable, measurable quantity. "Better decisions" becomes "percentage of decisions where I identified at least three alternatives and evaluated each against stated criteria before choosing." If you can't measure it, you can't monitor it.
Assign one leading and one lagging indicator. The leading indicator measures process execution. The lagging indicator measures outcome quality. Track both. When the lagging indicator fails but the leading indicator is green, your process is wrong. When the leading indicator fails, your execution is wrong. The diagnostic depends on having both signals.
Set a threshold, not just a direction. "More" is not a metric. "At least 80% completion rate per week" is a metric. The threshold is what separates "this agent is performing" from "this agent needs intervention." Without it, you are collecting data with no trigger for action.
Review the metric itself. Every metric is a hypothesis. After 30 days, ask: is hitting this metric actually correlated with the outcome I care about? If you've been nailing your reading agent's page-count target but can't recall or apply anything you've read, the metric is wrong. Replace it. The metric serves the agent, not the other way around.

The purpose of defining success metrics is not to turn your inner life into a dashboard. It is to replace "I think this is working" with "here is the evidence that this is working." Evidence is what makes the next lesson — choosing how often to monitor — a tractable question instead of a guess.