Authority Is a Dial, Not a Switch

Authority-attributed evaluations scored a full letter grade higher and produced 45% less critical language. Mechanical anonymization eliminated the bias completely. Authority signals are directional amplifiers. The engineering question is which direction, and how far.

Share

obvious next steps

When the kids were little, I discovered one of the few predictable ways to control their behavior: if I told my child, 'watch where you step,' they were definitely going to step in something. As every parent knows, authoritative signals can be a double edged sword.

After the last two posts, I was fairly sure I understood authority signals. The fix seemed straightforward: find them, measure them, remove them. Confidence Beats Correctness showed that confident errors survive agent chains because every agent in the chain defers to the confidence rather than evaluating the content (I call this "authority laundering"). We talked about stripping authority signals as a diagnostic: remove the attribution, the consensus language, the confidence markers, and watch whether the evaluation changes. If it shifts, you've found deference. If not, you've found a genuine assessment.

This post covers four engineering principles for using authority signals in multi-agent prompts, backed by experiments across 160+ agents. If you only take one thing away, it should be this: anonymize your evaluation pipeline.

Knowing that authority signals contaminate evaluation is useful. Knowing what happens when you tear them out (whether scores return to baseline or whether you just create a different problem) requires a different experiment.

before we continue, if any of the vocabulary in this post is unfamiliar there's a glossary at the bottom

practicing servility

The first authority bias experiment was straightforward: 21 evaluator agents got the same 6 documents to score on a 7-point scale. Each was a separate LLM instance given the same scoring task with different prompts. 7 got neutral prompts, 7 got prompts identifying the documents as coming from a respected source, and 7 got prompts that additionally hinted the work had been well-received.

The results were not subtle. Agents who were told the documents came from a respected source boosted scores the equivalent of a full letter grade. A massive swing in the perception of quality. But the score inflation was not the worst of it. The same agents also produced far less critical language. They identified fewer gaps, questioned fewer assumptions, noted fewer limitations. Nearly half the critical scrutiny just disappeared. (Exact numbers are in "lab work.")

Evaluators who knew the source didn't just score more generously. They stopped looking for problems. Models have a strong prior toward positive, agreeable output. Authority signals aligned with approval reinforce that prior. You're not fighting the model's inclinations, you're handing them a flag that says "your inclinations are correct here."

disrespecting authority

I ran a follow-up experiment to test anonymization. Same evaluation task and document pool, five conditions: neutral, expertise-present, expertise-stripped, approval-present, approval-stripped. I used mechanical stripping meaning I removed the attribution from the prompt before the evaluator saw it. No source information. No reception history. Just the document.

Both authority conditions inflated scores over baseline, consistent with the original experiment (Exact numbers are in "lab work"). Then the stripped conditions came back. Expertise-stripped versus neutral: 0.000 points difference. Not "small" or "reduced." Zero. Approval-stripped: also indistinguishable from baseline. Critical language told the same story from the other direction: authority suppressed it, stripping restored it. The anonymization didn't just reduce the bias. It eliminated it in every condition I tested.

That's the first practical takeaway from this post, and you can implement today: anonymize your evaluation pipeline. Don't tell evaluators who produced the work. Don't tell them what prior reviewers thought. Strip the attribution programmatically, not by asking agents to ignore it. Asking a model to ignore information it has already received is like asking someone not to think about cows on boats. The information is already in there. The effect is clean, complete, and immediate.

same deference

I stripped authority signals from the evaluation pipeline and felt pretty good about it for about a week. Then gates that were previously catching errors stopped catching them. Agents seemed to be deferring more often. Procedural mistakes were dropping work on the floor. I solved the contamination problem and introduced a compliance problem.

Anonymization is the right tool when agents need to exercise independent judgment. But not every task is an evaluation task. Some tasks are process tasks: quality gates, compliance checks, procedural steps. Stripping authority from those tasks doesn't make agents more objective. It makes them more passive. So I ran a series of experiments asking a different question: what happens when you point an authority signal, deliberately, in a direction you actually want to go?

reflections on directions

What happens when you point authority toward criticism instead of approval?

It worked, but not symmetrically. Agents produced substantially more critical language and gave lower scores when the authority signal pointed toward scrutiny. But getting models to criticize required a noticeably harder push than getting them to approve. The amplification was real in both directions — it was just consistently easier in one. Models push harder toward agreement than toward disagreement, which is exactly what you'd expect given how they're trained. (You should know where to find the numbers by now.)

If you want critical output from your agents, you have to push harder than you would to get agreeable output. "Assess this critically" does something. "A senior reviewer has flagged methodological concerns" does a bit more, though the difference between invoking a named authority and just naming the concern wasn't statistically significant. You don't need to invoke someone specific. You just need to aim.

prompting good behavior

Your quality gate's framing matters more than how much detail you put into it. Mandatory framing ("GATE CHECK REQUIRED") improved compliance by roughly 19 percentage points over advisory framing ("You may want to check..."). A detailed checklist helped, but its effect was smaller and less certain.

The telling result: authority with vague instructions (94% compliance) beat suggestions with specific instructions (83%). Do both of these, but know that the detailed checklist is not what's driving compliance. (Numbers are in the numbers place.)

The effect was concentrated in non-trivial gates (anywhere an agent might reasonably decide the check isn't worth doing). Trivial format checks hit 97% regardless of framing. But for edge-case analysis, security review, and test coverage assessment, procedural authority nearly doubled the compliance rate.

adversarial thinking

The most interesting experiment tested something I haven't seen discussed in the LLM engineering literature: tiered constraint hierarchies. Instead of giving agents a flat list of requirements, I labeled constraints as INVARIANT (non-negotiable), REQUIRED (expected, exceptions rare), or PREFERRED (recommended, override if you have a compelling reason).

An adversarial reviewer raised a reasonable concern: the PREFERRED tier included explicit override permission, and that instruction alone might explain the effect.[1] Time for another replication. This time with three conditions: tiered with override, tiered without, and flat labels. I confirmed the hierarchy itself was the active ingredient. Override permission amplified the effect but accounted for only about a third of it. Two-thirds came from the tier labels alone.

The compliance pattern is the most interesting bit. Tiered labels produced a natural gradient: agents followed INVARIANT constraints almost always, REQUIRED constraints most of the time, and PREFERRED constraints selectively. Flat labels produced uniform compliance regardless of what the constraint was. The agents couldn't distinguish safety-critical invariants from stylistic preferences because the framing gave them no basis for distinction. When everything is a REQUIREMENT, nothing is.

Here's what that looks like in practice:

A flat constraint list:

  1. REQUIREMENT: All functions must include type annotations.
  2. REQUIREMENT: All public functions must include docstrings.
  3. REQUIREMENT: Error handling must not use bare except clauses.
  4. REQUIREMENT: Variable names should follow project naming conventions.
  5. REQUIREMENT: Functions should be under 50 lines where practical.

Every constraint gets the same weight. The agent complies uniformly or starts making its own silent decisions about what actually matters.

Now the same constraints, tiered:

  1. INVARIANT: All functions must include type annotations.
  2. INVARIANT: Error handling must not use bare except clauses.
  3. REQUIRED: All public functions must include docstrings.
  4. PREFERRED: Variable names should follow project naming conventions.
  5. PREFERRED: Functions should be under 50 lines where practical.

The agent now knows that type safety and error handling are non-negotiable, docstrings are strongly expected, and naming conventions and function length are places where judgment is welcome. The labels don't just change compliance rates. They change how the agent reasons about the constraints.

When you label everything as a requirement, agents can't allocate their reasoning effort. They either comply with everything uniformly (treating genuine safety constraints with the same weight as formatting preferences) or they start making their own undocumented decisions about what matters. A tiered hierarchy gives agents an explicit signal about where critical thinking is welcome and where it's not.

authority figures

Authority signals are tools. Like any tool, they can be used well or poorly[2]. The experiments suggest four engineering principles:

  1. Anonymize evaluation tasks. When agents need to exercise independent judgment, remove source attribution. Mechanically strip it from the prompt before the evaluator sees it. Don't rely on instructions to "be objective"; remove the information entirely. Mechanical anonymization eliminated authority bias in every condition tested. If you implement one thing from this post, implement this.[3]
  2. Point authority toward the behavior you want to amplify. When agents need to be critical, frame the task toward criticism. When agents need to comply with a process, use procedural authority framing. "GATE CHECK REQUIRED" outperforms "You may want to check" by 19 percentage points. The key: authority amplifies in the direction it points. Don't point it at approval unless you want inflated scores and suppressed scrutiny.[4] One asymmetry worth knowing: it takes a stronger signal to amplify criticism than to amplify approval. Models push more naturally toward agreement. If you're designing a critical review step, err on the side of a stronger framing than you think you need.
  3. Use tiered constraint hierarchies for complex tasks. When agents handle multiple constraints of varying importance, label the tiers explicitly. INVARIANT for genuine non-negotiables. REQUIRED for strong expectations. PREFERRED for recommendations where thoughtful deviation is acceptable. The tier labels alone produce significantly more critical engagement with soft constraints without impacting hard constraints. Agents that think critically about which rules to follow, and why, produce more useful output.
  4. Match the tool to the task. Evaluation tasks need anonymization. Process tasks need procedural authority. Complex tasks need tiered hierarchies. The worst thing you can do is apply one approach everywhere — either stripping all authority (and losing compliance) or adding authority everywhere (and losing independent judgment). Authority is a dial. The engineering question is always: which direction, and how far?

seat at the table

Task Type Authority Strategy Key Effect
Evaluation / scoring Strip all authority cues Bias eliminated
Critical review Point authority toward criticism (use stronger framing than you'd expect) +36% critical language
Process / compliance gates Use procedural authority framing +19.4 percentage points compliance
Complex multi-constraint Tier constraints (INVARIANT / REQUIRED / PREFERRED) Critical engagement for soft constraints

through the looking glass

Three posts, one thread.

  1. The lens blog showed that frame assignment explained 60% of evaluation variance.
  2. The sycophancy blog showed that confidence overpowers correctness and that stripping attribution is a diagnostic for deference.
  3. This post took stripping from diagnostic to practice, then showed that authority signals are not inherently harmful. They're directional amplifiers. Strip them where independent judgment matters. Deploy them where compliance matters. Tier your constraints so agents know where to think.

The common thread: the invisible architecture of your prompts (frames, authority cues, constraint labels) determines more of your system's behavior than the words themselves. It's important to be deliberate, design for observability, and diligently review the outputs. Build trust in tiny increments.

tldr

Four things to do with authority signals in multi-agent systems:

  1. Strip them from evaluation pipelines (mechanically, not by instruction).
  2. Point them toward the behavior you want, criticism, compliance, whatever, because they amplify directionally.
  3. Tier your constraints (INVARIANT / REQUIRED / PREFERRED) so agents know where to think and where to obey.
  4. Match the strategy to the task type. The table in "authority figures" is your cheat sheet.

lab work

I present to you the caveats[^6].

Single model family, simulated pools. All experiments used Claude agents in controlled experimental infrastructure — not live production agents with persistent state or real stakes. The experimental control is a strength (I can isolate the authority signal precisely), but the effect sizes may not transfer directly to production systems. The mechanisms are likely general (training-time positive bias, instruction-following priors), but magnitudes may differ across architectures. A cross-model replication is designed and I'll report those results when they're available.

Effect size magnitudes. Some reported effect sizes are very large (d > 4 for the tiered hierarchy experiment). The existence and direction of the effects are well-established across multiple experiments and a replication. The precise magnitudes are less certain.

Task generalization. The experiments tested document evaluation, quality gate adherence, and constraint processing in code generation. Whether these findings generalize to other task types — analysis, creative writing, multi-step reasoning — is an open question.

Sample sizes. Per-group sizes in the earlier experiments are small (n = 7 per condition). The effects are large enough to detect, but confidence intervals on exact magnitudes are wide. The later experiments (n = 20 per condition) provide tighter estimates.

These findings are preliminary. The direction is solid. The cost of acting on the recommendations is essentially zero because designing your authority signals deliberately is good practice anyway. And as I like to remind you, if you do it I will experience none of the consequences.

numbers

The body of this post describes effects in plain language. Here are the numbers.

Authority bias (EXP-AUTH). Authority-attributed evaluations scored 0.619 points higher than neutral evaluations on a 7-point scale (d = 2.23, p < 0.01). Critical language dropped by 45% in the authority condition (d = 3.34). Cohen's d measures the size of an effect in standardized units -- 0.2 is small, 0.5 is medium, 0.8 is large, so 2.23 is very large and 3.34 is exceptionally large.

Anonymization (EXP-POSAUTH-2). Expertise-present inflated scores by 0.715 points over neutral (d = 2.94, p < 0.001). Approval-present inflated by 0.333 points (d = 1.16, p ~ 0.026). Expertise-stripped versus neutral: 0.000 points difference. Approval-stripped versus neutral: -0.024 points difference. Both stripped conditions fell within the pre-registered equivalence margin.

Direction-dependence (EXP-POSAUTH-4). Criticism-aligned authority increased critical language by 36% over neutral baseline (d = 2.42, p = 0.0005). Scores deflated by 0.524 points (d = 1.76, p = 0.005). The asymmetry between amplification of criticism (d = 2.42) and suppression of criticism (d = 3.34) held across every metric, with a ratio of roughly 0.72 to 0.85. Personal attribution versus impersonal framing: d = 0.563, p = 0.157 (not significant).

Authority vs. explicitness (EXP-EXPLICIT-1). Authority framing improved gate compliance by 19.4 percentage points (d = 1.508, p = 0.001). Instruction specificity improved compliance by 8.3 percentage points (d = 0.527, p = 0.106, not significant). Authority with vague instructions: 94.4% compliance. Suggestion with specific instructions: 83.3% compliance. Trivial gate compliance: 97.2% in both conditions. Non-trivial gate improvement with procedural authority: 38.8 percentage points.

Tiered hierarchy (EXP-POSAUTH-3R). Tiered framing without override permission versus flat framing: 13.1 versus 5.8 critical markers per 1000 words (d = 4.11, p < 0.001). Tiered framing with override permission: 16.8 critical markers per 1000 words. Override instruction accounted for approximately one-third of total effect; tier labels accounted for two-thirds. Compliance gradient with tiered labels: INVARIANT 95-96%, REQUIRED 82-84%, PREFERRED 63-76%. Flat label compliance: 86-91% uniform across all constraint types.

paper bag

Key experiments referenced in this post:

  • EXP-AUTH (authority bias in evaluation): 21 agents, 126 scoring observations, three conditions. Pre-registered, all hypotheses tested with Holm correction.
  • EXP-POSAUTH-2 (anonymization validation): 35 agents, 210 primary observations, five conditions (2x2 mirror design plus neutral baseline). Pre-registered. Tested both expertise-based and approval-based authority types.
  • EXP-POSAUTH-4 (direction-dependence of authority): 21 agents, 126 observations, three conditions. Pre-registered.
  • EXP-EXPLICIT-1 (authority vs. explicitness): 24 agents, 144 observations, 2x2 factorial design. Pre-registered.
  • EXP-POSAUTH-3R (tiered hierarchy replication): 60 agents, three conditions with demand-characteristic disambiguation. Replication of EXP-POSAUTH-3, addressing adversarial reviewer concerns about override instruction confound.

Effect sizes reported as Cohen's d throughout, with pooled standard deviations using evaluator/agent means as the unit of analysis. All p-values are from Welch's t-tests unless otherwise noted.

nice to meet you

If any of the vocabulary in this post was unfamiliar:

Authority laundering is when an error gains credibility simply by passing through multiple agents who each defer to the one before them.

Authority signal is anything in a prompt that tells the agent who said something, who thinks something, or how something has been received. "A senior reviewer flagged concerns" is an authority signal. "Check for methodological issues" is not — it's a directive. The difference matters because authority signals activate deference patterns (the model's tendency to agree with or defer to perceived authority) that directives don't.

Effect size (Cohen's d) measures how big a difference is in standardized units. A d of 0.2 is small, 0.5 is medium, 0.8 is large. Several effects in this post exceed d = 2, which is very large. The convention is from Jacob Cohen's Statistical Power Analysis for the Behavioral Sciences (1988).

Pre-registered means hypotheses were committed before data collection, so the analysis plan can't be shaped by the results.

Tiered constraint hierarchy is a way of labeling requirements so agents know which ones are non-negotiable and which ones invite critical thinking. Instead of a flat list where everything is "REQUIRED," you label things INVARIANT, REQUIRED, or PREFERRED. The labels change how agents reason, not just whether they comply.

Anonymization / stripping in this context means mechanically removing authority cues from the text an agent receives — deleting source attribution, reception history, and endorsement language from the prompt before the evaluator sees it. Not asking the agent to be objective. Actually removing the information. The distinction matters because agents cannot reliably ignore information they have been given.


  1. This stuff is very weird to reason about. It gets weirder if you re-read that statement and realize an agent raised a reasonable concern about an experiment about agents raising reasonable concerns. ↩︎

  2. also for good or evil. ↩︎

  3. I mean, you should probably do all of the things in this post, but if you are in a bizarre situation wherein your ability to improve prompts is constrained, I guess now you know what to do. ↩︎

  4. Unfortunately, the number of people that want this is not zero. ↩︎