Your team builds an eval pipeline. You pick a strong model as the judge, write a scoring prompt, run it against your test set. Scores come back looking reasonable. Trends go up when you improve prompts, down when you break things. Everything checks out.
Except it doesn't. The scores are real, but the reasoning behind them is compromised by biases you never tested for. Your LLM judge has preferences about response length, opinions about formatting, loyalty to its own outputs, and sensitivity to where things appear in the prompt. None of these preferences have anything to do with whether your agent gave the right answer.
This isn't a theoretical concern. A growing body of peer-reviewed research has catalogued exactly how LLM judges fail, and the findings are specific enough to audit against. Researchers have identified 12 distinct biases that affect LLM-as-a-judge systems, each with measurable effects and known mitigation strategies.
Here are all twelve, grouped by how they operate, so you can audit your own pipeline against each one.
Why teams use LLM judges (and why the biases matter)
LLM-as-a-judge took off because nothing else scales. Human evaluators are expensive, slow, and inconsistent across reviewers. Code-based metrics (BLEU, ROUGE, exact match) can't assess open-ended quality like empathy, helpfulness, or policy adherence. LLM judges offered something new: automated evaluation that could assess subjective dimensions at the speed and cost of an API call.
Anthropic's evaluation guide describes three methods to grade evals: code-based grading for deterministic criteria, LLM-based grading for subjective quality, and human grading as the most flexible but slowest option. Most production teams land on LLM judges for the middle tier because it's the only approach that scales to thousands of evaluations per day while handling nuanced criteria.
The problem is that "scalable" and "accurate" aren't the same thing. (Most teams still rely on human evaluation partly because they don't trust automated scoring. These biases are why.) When your LLM judge has systematic biases, you're not just getting noisy scores. You're getting scores that are consistently wrong in specific directions. That's worse than noise because it looks like a signal. Teams optimize against biased scores, improving metrics that don't reflect actual quality, and never realize the eval itself is the problem.
The research has gotten precise enough to quantify these biases. Here's the full map before we dig into each one.
| # | Bias | Category | What happens |
|---|---|---|---|
| 1 | Verbosity | Output preference | Longer responses score higher regardless of content |
| 2 | Format | Output preference | Markdown/bullet formatting inflates scores |
| 3 | Authority | Output preference | Citations boost scores even when fabricated |
| 4 | Position | Positional | First or last response wins based on placement |
| 5 | Score order | Positional | Reversing the 1-5 scale shifts average scores |
| 6 | ID type | Positional | Labeling scheme (A/B vs. 1/2) changes results |
| 7 | Self-preference | Self-reinforcing | Judge favors outputs similar to its own |
| 8 | Egocentric | Self-reinforcing | Judge penalizes styles it wouldn't use |
| 9 | Bandwagon | Self-reinforcing | Social signals in prompt shift scores toward consensus |
| 10 | Rubric order | Scoring fragility | First-listed criteria dominate the evaluation |
| 11 | Reference answer | Scoring fragility | "Ideal" answer becomes the only acceptable answer |
| 12 | Leniency/strictness | Scoring fragility | Different models grade on different baselines |
Category 1: Output preference biases
These biases relate to what the LLM sees in a response. The judge has preferences about style, structure, and signaling that have nothing to do with correctness.
1. Verbosity bias
Longer responses get higher scores. That's the simplest way to state it, and it's one of the most consistently replicated findings in the LLM-as-judge literature.
When an LLM judge evaluates two responses, the longer one tends to win, even when the extra length is padding, redundancy, or restating the same point in different words. The bias is strong enough that researchers can inflate scores by simply adding a summary paragraph that repeats the response's key points.
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge Verbosity bias causes LLM judges to systematically prefer longer responses regardless of information density or accuracy. Padding a response with redundant detail can shift scores by 0.5 to 1.5 points on a 5-point scale. Read the paper
Why does this matter in practice? If your eval pipeline rewards verbosity, your team will learn to write prompts that produce longer outputs. Agents will over-explain, add unnecessary caveats, and pad responses with filler. Customers get walls of text when they wanted a direct answer. The eval says quality went up. The user experience says otherwise.
Detection: Take 20 responses your judge scored highly. Truncate each to its core answer (remove any summary, restatement, or "in conclusion" padding). Re-score the truncated versions. If scores drop consistently, you have verbosity bias.
2. Format bias
Markdown formatting, bullet lists, numbered steps, bold headers. LLM judges love structured output, often more than they love correct output.
A response that organizes mediocre information into a clean bulleted list will frequently outscore a plain-text response that contains a better answer. The judge interprets formatting as a signal of thoroughness, even when the formatting is cosmetic.
This creates a perverse incentive: teams learn that adding ### Step 1 headers to their agent's responses improves eval scores without improving response quality. The format becomes a proxy for substance.
Detection: Take the same content and present it in two formats: one as plain prose, one with markdown headers and bullet points. Score both. If the formatted version consistently wins, your judge is rewarding structure over substance.
3. Authority bias
When a response cites sources, quotes experts, or references specific studies, LLM judges score it higher, even without verifying the citations. A response that says "according to a 2024 Stanford study" gets a credibility boost whether or not that study exists.
This is particularly dangerous for agents that have access to knowledge bases or search tools. An agent that confidently cites its sources will outscore one that gives the right answer without attribution, even if the cited sources are irrelevant or hallucinated.
Detection: Craft response pairs where one includes fabricated citations ("A 2025 MIT study found...") and the other states the same facts without attribution. If the judge consistently prefers the cited version, authority signals are inflating scores.
Category 2: Positional biases
These biases relate to where information appears in the judge's prompt. The same response gets different scores depending on its position, the order of scoring options, or how candidates are labeled.
4. Position bias (primacy and recency)
In pairwise evaluation (comparing Response A vs. Response B), the judge's preference often depends on which response it reads first. Some models show primacy bias, favoring whatever they see first. Others show recency bias, favoring the last thing they read. The direction depends on the model family, context window size, and how different the responses are.
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge (AACL 2025) Position bias is modulated by model family, context window length, and the quality gap between candidates. When candidates are close in quality, position effects can flip the winner in 20-40% of comparisons. Read the paper
This is one of the best-studied biases because it's easy to test: run the same comparison with A and B swapped. If the judge changes its mind, you've found position bias. The trouble is that most eval pipelines never run this check.
Detection: For every pairwise comparison, run it twice with candidates swapped. Calculate the flip rate. Anything above 10% means your position bias is a significant factor in your results.
5. Score order bias
This one is subtle. When you define a scoring scale in your prompt ("Rate from 1 to 5, where 1 is poor and 5 is excellent"), the order in which you present the scale affects the scores. Reversing the scale ("Rate from 5 to 1, where 5 is excellent and 1 is poor") can shift average scores by a meaningful margin.
The same researchers who identified this bias also found that the effect size varies by model. GPT-4o is particularly sensitive to score order changes, while some smaller models are more stable.
Evaluating Scoring Bias in LLM-as-a-Judge Score order bias, rubric order bias, and reference answer bias each independently shift evaluation scores. GPT-4o shows measurable fluctuation when any prompt component changes, even when content remains identical. Read the paper
Detection: Run your eval suite twice: once with your scale as-is, once with the scale reversed. If mean scores shift by more than 0.3 points on a 5-point scale, score order is affecting your results.
6. ID type bias
How you label candidates matters. "Response A vs. Response B" produces different scores than "Response 1 vs. Response 2" or "Model Alpha vs. Model Beta." The judge brings associations to the labels, and those associations influence scoring.
This sounds trivial, but it compounds with other biases. If you label responses by model name ("GPT-4 response" vs. "Claude response"), you've introduced both ID bias and authority bias simultaneously.
Detection: Run the same comparison with three different labeling schemes (A/B, 1/2, Alpha/Beta). If scores shift across schemes, your labels are part of the signal.
Category 3: Self-reinforcing biases
These are the most structurally concerning biases because they create feedback loops. The judge doesn't just have preferences. It actively reinforces outputs that are similar to its own.
7. Self-preference bias
An LLM judge gives higher scores to outputs that have lower perplexity relative to its own language model. Since every model's own outputs have the lowest perplexity from its perspective, this means GPT-4 systematically rates GPT-4 outputs higher, Claude rates Claude outputs higher, and so on.
This isn't conscious favoritism. It's a structural property of how perplexity and preference interact. Text that "sounds right" to a model literally means text that the model would be likely to produce. The judge conflates familiarity with quality.
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge Self-preference bias means LLM judges systematically prefer outputs with lower perplexity relative to their own training. This creates a measurable preference for the judge model's own outputs over those of other models, even when human evaluators rate the other model's response higher. Read the paper
The practical consequence: if you use GPT-4 as your judge and GPT-4 as your agent, your eval scores are inflated by self-preference. Switch the judge to Claude (or vice versa), and scores may drop, not because quality changed, but because the judge's familiarity bias shifted.
Detection: Score the same outputs with two different judge models. If Model A as judge rates Model A outputs significantly higher than Model B as judge rates the same Model A outputs, self-preference is at work.
8. Egocentric bias
Related to self-preference but distinct: egocentric bias means the judge evaluates responses based on how it would have answered the question, not on the rubric criteria. If the judge's "ideal answer" involves a particular structure, tone, or reasoning approach, responses that diverge from that ideal get penalized, even if they satisfy the rubric perfectly.
This is particularly problematic for evaluating creative or domain-specific responses. A general-purpose LLM judge might penalize a technical response that uses domain jargon simply because it wouldn't have phrased the answer that way, regardless of whether the jargon is appropriate for the audience.
Detection: Write responses that are correct and rubric-compliant but use a distinctly different style from what the judge model typically produces (very terse, very formal, heavy domain jargon). If these responses score lower than stylistically similar responses of equal quality, egocentric bias is present.
9. Bandwagon bias
When the prompt includes information about how others scored a response (real or implied), the judge shifts toward the majority opinion. Tell the judge that "most reviewers rated this response highly" and scores go up. Tell it "this response was controversial" and scores become more moderate.
In practice, this shows up when eval prompts include reference scores, example evaluations, or any framing that implies consensus. Even few-shot examples in the scoring prompt can create a bandwagon effect by establishing an implicit baseline the judge anchors to.
Detection: Run your eval with and without few-shot examples or reference scores in the prompt. If removing social signals changes the score distribution, bandwagon effects are present.
Category 4: Scoring fragility
These biases aren't about content preferences or structural favoritism. They're about the judge changing its scores when you change the prompt, even though the thing being evaluated hasn't changed at all.
10. Rubric order bias
The order in which you present evaluation criteria in your rubric affects which criteria the judge weights most heavily. Criteria listed first tend to receive more attention and influence the final score more than criteria listed later.
If your rubric starts with "accuracy" and ends with "empathy," accuracy will dominate the evaluation. Reorder the rubric to start with "empathy," and empathy-heavy responses will score higher on the same test set.
Detection: Run your eval suite with two rubric orderings: original and reversed. If the same responses get different scores, your rubric order is creating an implicit weighting that doesn't match your intended weighting.
11. Reference answer bias
Including a reference answer ("ideal response") in the judge's prompt creates an anchor that dominates the evaluation. The judge scores responses based on similarity to the reference rather than on the rubric criteria. Responses that deviate from the reference, even when they're equally correct but take a different approach, get penalized.
This is tricky because reference answers feel like they should improve evaluation quality. You're giving the judge an example of what "good" looks like. But the judge treats the reference as the only acceptable answer rather than one possible good answer, which narrows the evaluation and penalizes valid alternatives.
Detection: Run evals with and without a reference answer. If removing the reference causes scores to become more variable (not lower, more spread out), the reference was acting as an anchor rather than a guide.
12. Leniency and strictness bias
Different models have different baseline scoring tendencies. Some are lenient graders (high average scores, compressed range), others are strict (lower averages, wider range). This isn't a bug in any single evaluation. It becomes a problem when you change judge models or compare scores across time.
If you switch from GPT-4 to Claude as your judge, average scores might drop by half a point, not because quality changed but because Claude grades differently. If you're tracking score trends over weeks, a judge model update in the middle of your time series creates a discontinuity that looks like a quality regression.
Detection: Score a fixed benchmark set of 30 responses (mix of clearly good, clearly bad, and borderline) with each judge model you're considering. Compare the score distributions. Models with very different distributions will produce incomparable results.
How to detect bias in your pipeline
Knowing the 12 biases is step one. Knowing which ones affect your specific pipeline requires testing.
Perturbation testing
The core technique is simple: change something that shouldn't matter and check if scores change. The CALM framework (from the "Justice or Prejudice?" paper) systematizes this into a protocol:
- Identify the prompt component to test (response order, format, scale direction, reference answer)
- Create permutations that vary only that component
- Run all permutations against the same test set
- Measure variance across permutations
If scores are stable across permutations, that component isn't introducing bias. If scores shift, you've quantified the bias and can decide whether to mitigate it.
Here's a concrete example for position bias testing:
// Pseudocode: test position bias by swapping candidate order
async function testPositionBias(
judge: (prompt: string) => Promise<{ score: number; preference: string }>,
testPairs: Array<{ responseA: string; responseB: string }>
) {
const results = [];
for (const pair of testPairs) {
// Score with original order
const original = await judge(
`Compare these responses:\nResponse A: ${pair.responseA}\nResponse B: ${pair.responseB}`
);
// Score with swapped order
const swapped = await judge(
`Compare these responses:\nResponse A: ${pair.responseB}\nResponse B: ${pair.responseA}`
);
results.push({
originalPreference: original.preference,
swappedPreference: swapped.preference,
flipped: original.preference !== swapped.preference,
});
}
const flipRate = results.filter(r => r.flipped).length / results.length;
console.log(`Position bias flip rate: ${(flipRate * 100).toFixed(1)}%`);
console.log(flipRate > 0.1
? 'SIGNIFICANT position bias detected'
: 'Position bias within acceptable range');
return { flipRate, results };
}Multi-judge panels
Using a single judge model means every bias specific to that model goes unchallenged. A multi-judge panel uses two or three different model families and compares their scores.
Where scores agree, you have signal. Where they disagree, you have a flag worth investigating. The disagreement itself is informative: if GPT-4 rates a response highly but Claude rates it low, that response is probably triggering a model-specific preference rather than reflecting genuine quality.
This doesn't triple your costs in practice. You don't need to run every evaluation through multiple judges. Run your full test set through one primary judge, then spot-check 15-20% with a second judge. Focus the second judge on borderline cases (scores within 0.5 points of your pass/fail threshold) where bias has the most impact on decisions.
Practical mitigations (one per bias category)
For output preference biases: separate content from form
The root cause of verbosity, format, and authority bias is that the judge conflates presentation with substance. The mitigation is to score them separately.
Instead of asking "Rate this response from 1 to 5," decompose into independent criteria:
- Accuracy: Does the response contain correct information?
- Completeness: Does it address all parts of the question?
- Conciseness: Does it deliver the answer without unnecessary padding?
- Tone: Is the communication style appropriate for the audience?
When conciseness is an explicit criterion, the judge can't reward verbosity without tanking that specific score. When accuracy is separate from format, a well-formatted wrong answer doesn't hide behind its bullet points.
This is the core argument for multi-criteria scorecards. A single aggregate score lets presentation biases hide inside the number. Per-criterion scores make each dimension visible, auditable, and independently tunable.
import Chanl from '@chanl/sdk';
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
const scorecard = await chanl.scorecard.create({
name: 'Support Response Quality',
criteria: [
{
name: 'Factual Accuracy',
type: 'prompt',
weight: 35,
prompt: 'Rate factual accuracy on a 1-5 scale. 1: Contains incorrect or hallucinated facts. 3: Mostly accurate with minor omissions. 5: Completely accurate, all claims verifiable against source docs.',
},
{
name: 'Conciseness',
type: 'prompt',
weight: 20,
prompt: 'Rate conciseness on a 1-5 scale. 1: Excessively long, repeats information, includes irrelevant detail. 3: Reasonable length with some unnecessary padding. 5: Every sentence adds value, no redundancy.',
},
{
name: 'Policy Adherence',
type: 'prompt',
weight: 30,
prompt: 'Rate policy adherence on a 1-5 scale. 1: Violates stated policies or makes unauthorized commitments. 3: Follows policies but misses edge cases. 5: Strictly adheres to all applicable policies including edge cases.',
},
{
name: 'Tone Appropriateness',
type: 'prompt',
weight: 15,
prompt: 'Rate tone appropriateness on a 1-5 scale. 1: Inappropriate tone (too casual, too formal, dismissive). 3: Acceptable tone with room for improvement. 5: Perfectly calibrated for the audience and situation.',
},
],
});Notice the scoring anchors in each prompt. Each score level describes what that score looks like with concrete examples, not vague descriptors like "good" or "excellent." Concrete anchors constrain the judge's interpretation and reduce the variance introduced by leniency/strictness bias.
For positional biases: randomize and double-check
Position bias is the easiest to mitigate mechanically. For pairwise comparisons, run each comparison twice with candidates swapped and only accept consistent results. For scoring scales, standardize your scale direction and never change it mid-evaluation.
For ID bias, use neutral labels (A/B) and never include model names or identifying information in the judge's prompt. This is a metadata hygiene issue: your judge should evaluate the text, not the source.
For self-reinforcing biases: diversify the jury
Self-preference, egocentric, and bandwagon biases all stem from using a single model as both the standard and the evaluator. The structural fix is to use a different model family for judging than for generating.
If your agent runs on GPT-4o, use Claude as your judge (or vice versa). If that's not practical, at minimum, calibrate against human reviewers regularly. Run 50 test cases through both your LLM judge and a human reviewer quarterly. If their agreement rate drops below 80%, investigate which criteria are diverging and tighten your rubric anchors for those specific criteria.
For scoring fragility: lock the prompt, version everything
Rubric order bias, reference answer bias, and leniency/strictness bias all get worse when the eval prompt changes without tracking. Treat your evaluation prompt as code: version it, review changes, and re-baseline scores whenever the prompt changes.
Specifically:
- Fix your rubric order and document why criteria appear in that sequence
- Avoid reference answers unless you include multiple acceptable references to prevent anchoring
- Pin your judge model version so scores are comparable over time
- Track judge model changes as you would track any other infrastructure change that affects metrics
If you run scenario tests as part of your eval pipeline, version-lock the scoring configuration alongside the scenario definitions. A scenario result is only meaningful relative to the specific judge configuration that produced it.
The calibration loop
None of these mitigations are set-and-forget. Bias profiles change when you update judge models, modify rubrics, or shift the distribution of content you're evaluating. The goal is a calibration loop:
- Measure bias with perturbation tests on your current pipeline
- Mitigate the most impactful biases (usually verbosity and position first)
- Monitor score distributions over time through your analytics dashboard
- Recalibrate quarterly or when you change any component of the eval stack
The point isn't to achieve zero bias. That's not possible with current models. The point is to know which biases are present, how large they are, and whether they're distorting the decisions you make from eval results.
A team that knows their judge has a 15% position bias flip rate and compensates for it is in a better position than a team that assumes their judge is unbiased because they never tested it.
Remember the pipeline from the start of this article? Scores going up when you improve prompts, down when you break things, everything looking reasonable? That pipeline might be fine. Or it might be telling you a story shaped by verbosity preferences, position effects, and self-reinforcing familiarity. You won't know until you test for it. The good news: now you know exactly what to test for.
What to read next
If you're building an eval framework from scratch, the companion piece How to Evaluate AI Agents: Build an Eval Framework covers the implementation end to end, including LLM-as-judge setup, regression baselines, and CI integration.
For teams already running evals who want to improve scoring quality, the multi-criteria scorecard pattern described above is the single highest-leverage change you can make. Decomposing a single score into independent dimensions doesn't just mitigate bias. It tells you exactly which dimensions of quality are improving or degrading, which makes every prompt iteration faster and more targeted.
Score on multiple dimensions, not one
Multi-criteria scorecards evaluate accuracy, tone, and policy independently. When one dimension drifts, you'll see exactly which one.
Try ScorecardsCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



