A/B testing for agents

Opinion is not optimization

You built an agent. It works. You have opinions about how to make it better — a different prompt structure, a new model, a changed temperature setting, a reorganized workflow. You implement the change. It feels better. You move on.

That is not optimization. That is preference masquerading as progress.

Optimization requires comparison, and comparison requires structure. You need two versions running under the same conditions, measured against the same criteria, with the data — not your intuition — determining the winner. This is A/B testing: the simplest, most powerful experimental method available for improving any system, including the cognitive agents you are building across this curriculum.

L-0565 established that you need to know when to stop optimizing — that "good enough" is a legitimate standard. This lesson gives you the method for determining whether a proposed change actually moves you from good enough to genuinely better, or whether it just feels different.

Fisher's field: where controlled comparison began

The intellectual foundation of A/B testing traces to a single location: Rothamsted Experimental Station in Hertfordshire, England, where a statistician named Ronald A. Fisher arrived in 1919 to make sense of agricultural data that had been accumulating since 1843.

Fisher faced a deceptively simple problem. Farmers wanted to know which fertilizer produced better wheat yields. The obvious approach — apply fertilizer A to one field and fertilizer B to another — was useless, because the two fields differed in soil composition, drainage, sunlight, and dozens of other variables. Any difference in yield could be attributed to the fertilizer or to the field. There was no way to separate the signal from the noise.

Fisher's solution, published formally in The Design of Experiments (1935), was radical in its simplicity: randomization. Instead of assigning treatments to entire fields, divide each field into small plots and randomly assign treatments to plots. Randomization does not eliminate confounding variables — it distributes them evenly across both groups, so that any systematic difference in the outcome can be attributed to the treatment rather than to the background conditions.

Fisher integrated randomization with two additional principles — replication (run the experiment on enough plots to detect a real effect) and blocking (group similar plots together to reduce noise). Together, these three principles form the foundation of every controlled experiment conducted since, from clinical drug trials to the A/B tests running on every major technology platform today.

The insight that matters for your agents: you cannot evaluate a change by looking at its results in isolation. You can only evaluate a change by comparing it to what would have happened without the change, under equivalent conditions. That requires running both versions simultaneously, not sequentially, and measuring both against the same criteria.

From fields to factories to software: the A/B test scales

Fisher's agricultural experiments required physical plots of land and months of growing seasons. The randomized controlled trial (RCT), formalized in medicine by Austin Bradford Hill in the 1940s, required human subjects and years of follow-up. Both were expensive and slow.

The internet changed the economics of experimentation entirely.

In 2000, Google engineer Greg Linden ran one of the first large-scale A/B tests on a consumer product, testing Amazon's recommendation engine. The marginal cost of showing different versions to different users was effectively zero. The sample sizes were enormous. The feedback was immediate. Ron Kohavi, who later led Microsoft's Experimentation Platform, recognized that this combination — zero marginal cost, massive samples, fast feedback — made online controlled experiments the single most powerful innovation methodology available to technology companies.

By the time Kohavi, Diane Tang, and Ya Xu published Trustworthy Online Controlled Experiments (2020), A/B testing had become the primary decision-making mechanism at every major technology company. Bing alone runs approximately 300 experiment treatments per week. Google runs thousands simultaneously. In one documented case, a single Bing A/B test identified a change that increased revenue by over $100 million annually in the US alone — a change that no amount of expert analysis had predicted.

The lesson from the technology industry is not about technology. It is about epistemology. The companies that improved fastest were not the ones with the best intuitions about what would work. They were the ones that tested the most alternatives against real data. Kohavi's summary of two decades of experimentation research converges on a sobering finding: across Microsoft, Google, and other large platforms, only about one-third of ideas that seemed promising actually produced measurable improvement when tested. The other two-thirds either had no effect or made things worse. Expert judgment, no matter how sophisticated, was wrong more often than it was right. Only the experiment revealed the truth.

The anatomy of a valid comparison

An A/B test has five structural requirements. Violate any of them and you are not testing — you are guessing with extra steps.

A control version (A). This is your current system, unchanged. The control is not a placeholder — it is your baseline for comparison. Without it, you have no way to know whether version B is better than what you already had or merely different.

A treatment version (B). This is the modified system — one change, isolated and intentional. The discipline of limiting B to a single modification is what distinguishes an experiment from a scramble. If you change three things at once and performance improves, you have learned nothing about which change mattered.

Random assignment. The inputs to your test — whether they are users, tasks, days, or prompts — must be assigned to A or B randomly, not by your judgment. You might think you can assign "easy tasks to A and hard tasks to B and account for the difference." You cannot. Selection bias is invisible, pervasive, and fatal to valid inference. Fisher proved this a century ago. It remains true.

A pre-defined metric. Before you run the test, you must decide what "better" means. Faster? More accurate? Higher satisfaction? Lower cost? If you decide after seeing the results, you will unconsciously choose the metric that makes your preferred version look better. This is not dishonesty — it is the human pattern-matching engine doing what it does. Pre-registration of your evaluation criteria is the antidote.

Sufficient sample size. A single comparison between A and B tells you nothing. You need enough data points to distinguish a real difference from random noise. The exact number depends on the size of the effect you are looking for and how much natural variability exists in your system. In tech A/B testing, the standard is 95% statistical confidence — meaning there is only a 5% chance that the observed difference is due to chance. For personal agent optimization, you do not need formal statistics, but you do need enough repetitions to trust the pattern.

These five elements — control, treatment, randomization, pre-defined metric, and sufficient sample — are not optional features of a good test. They are the definition of a valid test. Remove any one of them and you are back to opinion.

Beyond the classic split: the multi-armed bandit

Traditional A/B testing has a structural limitation: it splits your resources evenly between A and B for the entire duration of the test, even after one version is clearly outperforming the other. This is called the explore-exploit tradeoff. While you are exploring (gathering data about which version is better), you are failing to exploit (direct all resources to the better version). The cost of exploration is the performance you sacrifice by continuing to use the inferior version.

The multi-armed bandit algorithm, named after a gambler facing a row of slot machines with unknown payoff rates, addresses this directly. Instead of a fixed 50/50 split, the bandit algorithm dynamically shifts traffic toward the better-performing version while maintaining enough exploration to detect if the underperformer improves or if a third option emerges.

Thompson Sampling, the most widely used bandit approach, works by maintaining a probability estimate for each version and sampling from those estimates to make allocation decisions. Early in the test, when uncertainty is high, allocation is roughly even — the algorithm explores broadly. As data accumulates and one version's superiority becomes clear, allocation shifts dramatically — the algorithm exploits what it has learned. The beauty of Thompson Sampling is that it requires no explicit switch point. The transition from exploration to exploitation happens naturally as confidence grows.

For personal agent optimization, the bandit principle translates to a practical heuristic: you do not need to commit to equal time on both versions for the entire test. Start with equal allocation. As one version begins to show consistent superiority, shift more of your usage toward it while maintaining occasional checks on the other. This is not abandoning rigor — it is applying the same mathematical principle that powers recommendation engines at Netflix and content optimization at every major media company.

The key difference between bandits and intuition: the bandit shifts allocation based on accumulated data against pre-defined metrics. Intuition shifts allocation based on feelings about the most recent data point. One compounds knowledge. The other chases noise.

Hyperparameter tuning: A/B testing at machine scale

Machine learning engineering applies the A/B testing principle at a scale and speed that reveals something fundamental about the nature of optimization.

When training a machine learning model, engineers must choose values for parameters that control the learning process — learning rate, batch size, regularization strength, network architecture. These are called hyperparameters because they sit above the parameters the model learns on its own. The right hyperparameters can mean the difference between a model that achieves 70% accuracy and one that achieves 95%.

The simplest approach is grid search: define a range for each hyperparameter, test every possible combination, and keep the best. This is exhaustive A/B testing — every combination against every other combination. It is thorough and catastrophically slow. A model with five hyperparameters, each tested at ten values, requires 100,000 training runs.

Random search, proposed by James Bergstra and Yoshua Bengio in 2012, demonstrated something counterintuitive: testing random combinations of hyperparameters often finds better configurations faster than testing every combination systematically. The reason is that some hyperparameters matter far more than others. Grid search wastes most of its budget exhaustively varying parameters that barely affect performance. Random search, by distributing trials across the full space, has a better chance of stumbling into the high-impact dimensions.

Bayesian optimization goes further still. Instead of testing blindly (grid search) or randomly (random search), Bayesian optimization builds a model of how hyperparameters relate to performance and uses that model to choose the next combination to test. It learns from each experiment and directs future experiments toward the most promising regions of the search space. Research published at NeurIPS and ICML has consistently shown that Bayesian optimization achieves better results in fewer evaluations than both grid and random search.

The progression from grid search to random search to Bayesian optimization maps directly onto how you should think about optimizing your personal agents:

Level 1 (Grid search thinking): Test every idea you have, systematically. This is thorough but slow. Good for your first optimization pass.

Level 2 (Random search thinking): Do not try to be systematic. Generate varied alternatives and test them. You will cover more ground than a methodical sweep and are more likely to discover unexpected improvements.

Level 3 (Bayesian thinking): Learn from each test. Let your results inform what you test next. If changing the prompt structure improved performance, test further variations of prompt structure rather than switching to an unrelated variable. Direct your experiments toward the regions where improvement is most likely.

A/B testing for the agents in your life

Everything above applies to your personal epistemic agents — the cognitive systems, routines, AI tools, and decision processes you are building throughout this curriculum.

Testing AI prompts. You have a prompt that generates your weekly review. Write a modified version — perhaps one that asks different questions, or structures the output differently, or emphasizes different dimensions of your week. Run both for three weeks. Compare the outputs against specific criteria you define in advance: usefulness, completeness, actionability. Let the data tell you which prompt serves you better.

Testing routines. Your morning routine is an agent — a repeatable process that produces an outcome. Suspect that exercise before deep work outperforms exercise after? Run version A (exercise first) for two weeks and version B (exercise after) for two weeks. Track energy levels, focus quality, and output quantity. The two-week minimum accounts for novelty effects and weekly variation.

Testing decision processes. You use a specific framework for making decisions — perhaps a pros-and-cons list, or a pre-mortem, or a decision journal. Design an alternative. For the next month, randomly assign decisions to framework A or framework B. After the month, review which framework produced decisions you were more satisfied with in retrospect.

Testing information diets. You consume information through specific channels — newsletters, podcasts, social feeds, books. Suspect that one channel produces more actionable insight per hour than another? Track it. Spend two weeks emphasizing channel A and two weeks emphasizing channel B. Measure what you retained, what you applied, and what changed your thinking.

The common pattern: take something you currently do one way, design a specific alternative, run both under comparable conditions, measure against criteria you chose before starting, and let the comparison tell you what your intuition cannot.

The discipline that makes comparison trustworthy

A/B testing is structurally simple. Its difficulty is entirely in the discipline.

The first temptation is to peek at results early and stop the test when one version is ahead. This is called the peeking problem, and it systematically inflates false positives. A version that is ahead after three data points may be behind after thirty. Commit to your sample size in advance and honor it.

The second temptation is to explain away results you do not like. Version B performed worse, but you had a bad week, or the tasks were unusually hard, or you were not really trying. These explanations may be true. They are also exactly the kind of post-hoc rationalization that controlled experiments are designed to overcome. If your test was properly designed — random assignment, sufficient sample, pre-defined metric — trust the data over the narrative.

The third temptation is to test too many things at once. You have seven ideas for improving your agent. Testing them one at a time feels painfully slow. So you bundle three changes into version B. This is the confounding problem that connects directly to L-0567. When version B wins, you do not know which of the three changes caused the improvement. When version B loses, you do not know whether one brilliant change was dragged down by two bad ones. Sequential single-variable tests are slower but produce actual knowledge. Bundled tests are faster but produce stories.

The underlying principle is this: an A/B test is not a way to prove that your idea works. It is a way to find out whether your idea works. The distinction matters because it determines your relationship to the result. If you are trying to prove something, negative results feel like failure. If you are trying to find out something, negative results are exactly as informative as positive ones. Kohavi found that two-thirds of promising ideas at Microsoft produced no improvement. Each negative result was as valuable as each positive one — it prevented the company from shipping a change that would have degraded the product while feeling like progress.

From comparison to isolation

You now have a method: run two versions, measure the difference, keep the winner. This is the engine of data-driven improvement for every agent in your system — AI agents, cognitive routines, decision frameworks, information processes.

But the engine has a critical dependency. The validity of your comparison rests entirely on one condition: that the only difference between A and B is the variable you intended to change. If other variables crept in — different conditions, different inputs, different contexts — your comparison is contaminated. You might keep the wrong version or discard the right one, and the data would look clean either way.

L-0567 addresses this dependency directly. Isolating variables when optimizing is the discipline that ensures your A/B tests actually measure what you think they measure. Without it, you have the form of experimentation without the substance. With it, every comparison you run compounds into genuine understanding of what makes your agents better.

Sources:

Fisher, R. A. (1935). The Design of Experiments. Oliver & Boyd. (Randomization, replication, and blocking as foundations of experimental design)
Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. (One-third success rate of ideas, Bing experimentation at scale)
Kohavi, R. et al. (2009). "Online Controlled Experiments at Scale: Lessons and Extensions to Medicine." Proceedings of KDD 2009. (Bing revenue experiment, 300 experiments per week)
Bergstra, J. & Bengio, Y. (2012). "Random Search for Hyper-Parameter Optimization." Journal of Machine Learning Research, 13, 281-305. (Random search outperforming grid search)
Thompson, W. R. (1933). "On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples." Biometrika, 25(3/4), 285-294. (Thompson Sampling for multi-armed bandits)
Roberts, S. (2004). "Self-experimentation as a Source of New Ideas." Behavioral and Brain Sciences, 27(2), 227-262. (N-of-1 personal experimentation methodology)
Bischl, B. et al. (2023). "Hyperparameter Optimization: Foundations, Algorithms, Best Practices, and Open Challenges." WIREs Data Mining and Knowledge Discovery. (Bayesian optimization superiority in fewer evaluations)