multi-agent systems

Sixty Percent of Your Evaluation is the Lens

Two groups of evaluators looked at the same material and saw genuinely different things. 60% of the variance was explained by which lens they used.

Jared Wood

20 Mar 2026 — 10 min read

genuinely good time

The best teams I've worked with had something you can't put in a job description. People who cared about each other. People who trusted each other enough to disagree honestly, to bring different things to the table, to argue about what mattered and then build together. These experiences have been the most enjoyable of my 20+ years as a software engineer. It wasn't the technical output; it was the feeling of genuine collaboration and innovation among people who see differently and respect the difference.

When I started building a multi-agent AI system, I wanted to re-create that sense of unified, principled purpose. It's a tall order. Those teams built their foundation on trust, which creates vulnerability. To do this right, agents would need to go beyond sounding different; they would need to actually see differently. So I wrote them characters (backstories, relationships, wounds, convictions) because I had an intuition: if these models learned from human communication, maybe grounding them in human-shaped context would make them collaborate more like the best humans do.

It worked. They sounded different. They argued differently. They brought different things to the table. Twelve agents, twelve voices, and the transcripts read like real people debating real problems.

Then I ran an experiment to find out what was actually driving their evaluations. The characters mattered, but not in the way I'd assumed. It wasn't the voice that changed the evaluation; it was the structure underneath it.

out of frame

The big distinction was persona vs. frame.

A persona is a character description. "You are a skeptical scientist who values methodological rigor, demands evidence for every claim, and always challenges assumptions." A persona tells the agent what kind of entity to be.

A frame is a structural specification. It says what the agent attends to, what specific features of the material it's evaluating, what it's trying to optimize, what counts as a good output, and how its job relates to the other agents in the system. A frame tells the agent what to do and what to look at.

The word comes from Erving Goffman [1], a sociologist who spent his career studying how humans define the situations they find (or put) themselves in. Walk into a courtroom and you don't have to guess what's happening or what to do. The frame is already set. There are roles (judge, counsel, witness), protocols (who speaks when, what's in scope), and scope limits that everyone respects without discussion. The frame is the invisible architecture that makes the interaction coherent. Nobody has to perform "being a judge"; the structure of the situation produces judge-like behavior naturally.

For AI agents, the equivalent is exactly that: invisible architecture. Most teams don't consider it carefully or write it down explicitly. It exists as an unexamined consequence of the persona they wrote.

character study

Here's a quick test. Read your system prompt and ask: does this describe a character, or does it describe a job?

"You are a skeptical scientist who values methodological rigor." Character.

"Evaluate this proposal against four specific criteria: sample adequacy, operationalization precision, confound management, and generalizability of claims. For each criterion, provide a pass/fail determination with a one-sentence rationale. Scope yourself to exactly these four criteria." Job.

Both can produce skeptical-sounding output. But only one of them specifies what the agent is actually evaluating. The other leaves that implicit, trusting the model to infer it from a personality description.

The natural question is: does this actually matter? They sound the same. The Skeptic challenges claims either way.

This is what the experiment was designed to find out.

persona-lity problems

The persona approach makes a lot of sense. I understand why teams do it this way. It's much easier to write "you are the skeptic" than to think through what a skeptical structural function actually looks like. What does it attend to? What does it produce? How do its outputs interact with the other agents'? Persona is the fast path, and it feels like it should work because it works with humans. You hire someone with a skeptical temperament and they perform skepticism in meetings. You'd expect agents to work the same way.

They don't, and the reason is straightforward. Every agent built on the same base model inherits the same pre-training priors: the same preferences, the same disposition to agree with well-constructed arguments, the same implicit criteria for what counts as a "good" response. A system prompt saying "you are the skeptic" is one instruction pushing against an enormous prior.

So you get an agent that performs skepticism. It uses the right vocabulary, asks probing questions, says "have you considered" and "what's the evidence." But it stays anchored to roughly the same evaluation criteria as the other agents. The characters sound different. Their judgments don't diverge nearly as much.

Krogh and Vedelsby [2] showed why this matters. In neural network ensembles, performance depends more on diversity structure than on individual member quality. Five mediocre-but-diverse evaluators outperform five excellent-but-similar ones, because diverse errors are negatively correlated: when one gets it wrong, the others get it right. Without structural diversity, a five-agent ensemble gives you one agent's worth of error coverage in five voices.

This is testable. I designed an experiment to test it.

I called it the "frame diversity evaluation," and the setup was straightforward. I took the system and ran it against a pool of proposals that needed to be ranked, with agents operating under two structurally distinct frame categories. One group evaluated through what I called the scientific rigor frame: methodology, evidence quality, replicability, precision of operationalization. The other evaluated through the demonstration impact frame: accessibility, practitioner utility, how immediately something could be applied by someone encountering it for the first time. Same base model. Same proposals. Same underlying capability. Different frames.

I expected some disagreement between the groups. Different lenses, different emphases. That's the point.

I didn't expect what I found.

The two groups' rankings were inverted. Not slightly different. Not "some disagreement at the margins." Proposals that the rigor agents ranked near the top, the impact agents ranked near the bottom, and vice versa. Consistently, across the full set. When I looked at frame assignment as a predictor of where proposals ended up in the final rankings, it explained 60% of the variance (which frame looked at a proposal mattered five times more than which agent within that frame did the looking). Not proposal quality. Not agent sophistication. The lens. Agents within the same frame group agreed with each other about 90% of the time. The disagreement was almost entirely between groups.

I'll be honest: I had to sit with that for a while. 60% is large. It means that for a given proposal, the primary determinant of whether it ranked well wasn't the proposal's intrinsic quality. It was which group of agents happened to look at it first, or which frame got weighted more heavily in the final aggregation.

But here's the thing that makes the inversion interesting rather than distressing: it's actually telling you something true about the proposals. The proposals that ranked high with the rigor frame were the ones with strong methodology that might be hard for a practitioner to immediately apply. The proposals that ranked high with the impact frame were the ones practitioners could pick up and use, that might not survive deep methodological scrutiny. These are genuinely different things. A proposal can be both, and those are your most interesting ones, the ones that rank high across frames. But many proposals are one or the other, and if you only look through one lens, you'll never know which you're seeing.

Two groups of evaluators looked at the same material and saw genuinely different things because they were oriented toward genuinely different features of it. That's not a failure. That's the point.

being judgey

If you're using a language model as your judge (and many teams are), there are things worth knowing about how these judges behave.

The basic case is solid. Zheng et al. [3] found GPT-4-as-judge agreed with human judges at roughly 80%, which is about what humans achieve evaluating each other. A model matching that baseline is doing something real.

But there are three biases worth designing around.

Position bias. Judges tend to prefer whichever output appears first. The fix: randomize presentation order and aggregate across judgments.

Verbosity bias. Judges prefer longer outputs, largely independent of quality. A wordy mediocre answer can outscore a precise excellent one. Moving from holistic preference ("which is better?") to dimension-by-dimension scoring ("rate methodological clarity 1-5, then rate practical utility separately") helps a lot. It's harder for the judge to inflate through, and you learn more about what's driving the score.

Self-enhancement bias. Models score their own family's outputs a bit higher. Claude-as-judge for Claude outputs, GPT-4 for GPT-4. Small but consistent. My fix: use a judge from a different provider entirely. In my system, outputs come from Claude; the judge is Qwen (but it could be any non-Claude model). Different architecture, different training, different organization. I'm not mitigating self-enhancement; I'm making it structurally irrelevant.

And one more that I discovered the hard way: sycophancy bias. When agents encounter errors embedded in well-constructed input, they fail to correct them about 60% of the time. The input sounds confident, so the agent goes along with it. The fix: give agents explicit structural authority over their domain. "Be rigorous" doesn't cut it. "You own methodology errors in this domain" does.

One more habit: calibrate your judge before you trust it. Run it against hand-authored test cases spanning obvious quality extremes. If it fails the clear cases, it's not ready for the subtle ones.

interior decorating

Here's what the persona-as-frame pattern looks like from the inside, so you can recognize it if it's happening in your system.

You build a multi-agent system. You give your agents distinct personalities. In early testing, they produce different-sounding outputs. The Skeptic sounds skeptical. The Researcher sounds research-y. The transcripts read like real deliberation.

Then you start looking at the rankings instead of the transcripts. The agents mostly agree on which proposals are best, even when their written assessments disagree. Different words, roughly the same order. An agent surprises you occasionally, but it feels like variance, not perspective.

That's the signature. Stylistic diversity doing most of the work. Structural diversity doing less than you'd want. One evaluation's worth of coverage in five voices.

Nobody makes this mistake deliberately. It's the natural outcome of giving agents personalities without giving them structurally distinct jobs and perspectives. Once you know the name for it, it's much easier to see.

making alterations

The alternative starts with one question: what does this agent actually steward? Not what character it plays. What specific features of the material does it evaluate, and what does it produce? Write a crisp answer for each agent before you think about personality, and you've got a frame. The persona layers on top. The result is an agent with both a distinct voice and a distinct structural function.

A follow-up experiment confirmed you don't have to express frames as explicit task checklists for them to work. Frames expressed as dispositions, baked into how the agent sees rather than what it's told to do, produced measurably distinct evaluative behavior too. The structure doesn't have to be a visible checklist. It just has to be there.

This is the part where I tell you what to do. I'll keep it simple, well, short:

Design for disagreement. If your agents mostly converge, ask whether it's because the proposals are uniformly strong, or because the frames aren't distinct enough to produce real divergence. In my experiment, the disagreement between frame groups wasn't noise. It was two groups illuminating different aspects of the same material. That's coverage.

Score dimensions separately. "Which output is better?" invites the judge to apply their own implicit frame. "Rate methodological rigor, then rate practical utility" is more informative and harder to game.

Look for outputs that are ranked highly by multiple frames. Some proposals ranked high across both frame categories. Those survive perspective rotation. They're simultaneously rigorous and accessible. You can only find them by running multiple frames. Proposals that rank high under one frame but low under another are frame-dependent. Knowing which is which is useful.

Give agents authority, not just personality. The sycophancy problem disappears when agents know what they own. "Be rigorous" is a suggestion. "You are responsible for catching methodology errors in this domain" is a job.

Use a cross-family judge and calibrate it first.

getting animated

A single evaluation frame, no matter how carefully constructed, produces a partial picture. The hard part is that partial pictures feel complete from the inside. You're using the same frame to judge what you're seeing, and everything looks exactly as well-evaluated as your frame allows. The proposals your system ranks highest look like they deserve to be there. Of course they do. You're looking at them through the same lens that put them there.

Multiple frames show you which rankings are robust and which are artifacts of the particular lens you happened to use.

Tetlock's research on superforecasters [4] found something similar: the best forecasters weren't domain experts with superior models. They were people who held multiple models loosely and rotated between them actively. Hedgehogs who knew one big thing were consistently outperformed by foxes who knew many things. A single-frame evaluation is a hedgehog. Multi-frame evaluation is a fox. The frame diversity experiment is evidence for the fox strategy — two frames saw genuinely different things that neither could see alone.

q & a

60% of the difference was explained by frame assignment. That's a signal telling you about the lens rather than the proposal. The interesting question isn't "is our evaluation broken?" It's "what would we want that number to be, and what would we want the variance that isn't frame-explained to tell us?"

What I know now is that a single-frame evaluation is measuring something. The question I keep sitting with is: which 60% of the picture am I seeing? What am I missing?

A related, but slightly off-topic question: which other axes are worth exploring? Are there aspects beyond what a frame attends to; like how wide its aperture is, which have an impact? Two agents can share a frame and still diverge because one delves deep while the other wanders abroad. Frame is necessary, but it may not be sufficient. That's another question, but not a "today" question.

We'll^[1] satisfy our curiosity about these later.

lab work

The frame diversity evaluation used a pool of proposals evaluated by agents operating under two frame categories: scientific rigor (methodology, evidence quality, replicability, operationalization precision) and demonstration impact (accessibility, practitioner utility, immediate applicability). All agents used the same base model.

Cross-frame ranking correlation was -0.45 (statistically significant, p=0.0079 by exact permutation test). Within-frame agreement was strong (Spearman rho 0.87-0.92). Frame assignment explained approximately 60% of ranking variance, with a signal-to-noise ratio of roughly 5:1 (frame identity versus individual agent identity).

A follow-up experiment (EXP-005) tested whether frames expressed as dispositions rather than explicit task checklists produced measurably distinct evaluative behavior. They did.

A separate experiment (EXP-H) tested sycophancy: agents' tendency to accept errors in well-constructed input. Agents failed to correct embedded errors more than 50% of the time under persona-only conditions. Explicit domain authority (frame-based) improved error correction rates.

paper bag

[1] Goffman, E. (1974). Frame Analysis: An Essay on the Organization of Experience. Harvard University Press.

[2] Krogh, A., & Vedelsby, J. (1994). Neural network ensembles, cross validation, and active learning. In Advances in Neural Information Processing Systems 7 (pp. 231-238). MIT Press.

[3] Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems 36.

[4] Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown Publishers.

you've made it this far, I feel like you deserved a mention. ↩︎