Model grading • evalthat

Among other things, the evalthat package provides a suite of functionality for model grading or, as it’s often referred to in the literature, LLM-as-a-judge. Model grading entails an LLM evaluating a response from an LLM. That is, after asking a question to an LLM and receiving an answer, both the question and answer are provided to another language model and that model is asked to somehow judge whether the provided response was satisfactory.

Given the infamously stochastic and gullible tendencies of LLMs, it’s not all obvious that this should work. Indeed, many earlier attempts at such a framework were unsuccessful by most measures, and only recently have sufficient advancements been made to the point where model grading is a helpful tool in an evaluation toolkit.

The design of evalthat’s model grading tools is heavily influenced by the most recent research on LLM-as-a-judge. This vignette will outline some of the findings that guided evalthat’s interface for model grading.

Models should not judge their own output

Ergonomically, it’s totally reasonable that a practitioner would want to use the same model to generate answers as they use to judge them. If I’ve gone to the effort to set up an API key and configured access to a model, can’t I just use it to judge its own responses? This is unfortunately not the case; LLMs are prone to self-enhancement bias, where they’re likely to prefer their own answer over one supplied by another model. This holds up even when models don’t know where a given response arises from (Ye et al. 2024). As such, evalthat will exclude models from evaluations of responses they generated themselves by default.

Judge using “strong” LLMs

To reduce the compute associated with evaluating, many studies have proposed making use of smaller models fine-tuned specifically for judging. While these models have shown promise in some applications (Verga et al. 2024), research has generally shown that larger models intended for broader use cases tend to make for better evaluators; “although the fine-tuned judge models achieve high performance on in-domain test sets, even surpassing GPT-4, they underperform GPT-4 across several dimensions, including generalizability, fairness, aspect-specific evaluation, and scalability” (Fu et al. 2024; Huang et al. 2024). Colloquially, such models are often referred to as “strong” LLMs (Zheng et al. 2023; Gu et al. 2025).

That said, several of the findings cited here are based on results utilizing—at least in part—smaller, open-source models as judges (Fu et al. 2024; Schroeder and Wood-Doughty 2024).

Prefer pairwise comparisons over scoring

LLMs are often used to evaluate output in two notable ways:

Pairwise comparisons: Two models are asked a question (possibly with a human-written reference answer) and both provide answers. Then, a third model is provided the question, the desired answer, and the two model responses, and is asked to choose one of the two model responses as its preference.
Scoring: A model is asked a question (possibly with a human-written reference answer) and provides an answer. Then, another model is provided the question, the desired answer, and the model’s response, possibly along with a rubric, and is asked to rate the response according to the rubric on some numeric scale.

Scoring methods have been shown to be easily influenced by connotations of wordings and unrelated pieces of context, while models have been shown to evaluate more reliably when comparing pairwise (Ouyang et al. 2022; Schroeder and Wood-Doughty 2024).

Position matters

In the context of pairwise comparisons, many language models are vulnerable to position bias, where models will tend to prefer the response presented first over the one presented second (or vice versa) regardless of the quality of the response (Wang et al. 2023). Some models are much more resilient to this than others—for example, GPT-4-turbo was shown to prefer the same response when the order was swapped 80.5% of the time, while LLaMA3-8B-Instruct did so 38.9% of the time (Gu et al. 2025). Generally, strong LLMs tend to be less susceptible to this bias and a variety of methods exist to address it.

Run the same evaluation multiple times

Notably, related to addressing position bias, summarizing evaluations across multiple runs on the same content has been shown to lead to better evaluations.

To mitigate position bias, Zheng et al. (2023) call the same judge twice “by swapping the order of two answers and only declare a win when an answer is preferred in both orders.” When the preferences are inconsistent after swapping orders, they call it a tie. In the context of scoring, Wang et al. (2023) do so by polling the judge with both orders and then taking the average of the score.

Other studies have proposed that aggregating evaluations across several separate models can mitigate other biases. For example, many models demonstrate verbosity bias, where they tend to rate longer responses more highly. However, Ye et al. (2024) showed that “response length influences model judgment in complex ways;” some models are much less susceptible to this bias than others, and some even penalize excessively verbose answers. A similar story goes for compassion fade, where models might prefer responses containing references to positively connoted words, or attentional bias, where models might disproportionately incorporate irrelevant information into their evaluation (Koo et al. 2024). The fact that models vary in their susceptibility to these biases means that aggregating responses across many of them often outperforms use of a single judge (Verga et al. 2024; Schroeder and Wood-Doughty 2024). Gu et al. (2025) show this to be the case with majority voting specifically.

Fu, Jinlan, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2024. “GPTScore: Evaluate as You Desire.” In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), edited by Kevin Duh, Helena Gomez, and Steven Bethard, 6556–76. Mexico City, Mexico: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.naacl-long.365.

Gu, Jiawei, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, et al. 2025. “A Survey on LLM-as-a-Judge.” https://arxiv.org/abs/2411.15594.

Huang, Hui, Yingqi Qu, Xingyuan Bu, Hongli Zhou, Jing Liu, Muyun Yang, Bing Xu, and Tiejun Zhao. 2024. “An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-Tuned Judge Model Is Not a General Substitute for GPT-4.” https://arxiv.org/abs/2403.02839.

Koo, Ryan, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. 2024. “Benchmarking Cognitive Biases in Large Language Models as Evaluators.” https://arxiv.org/abs/2309.17012.

Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems 35: 27730–44.

Schroeder, Kayla, and Zach Wood-Doughty. 2024. “Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge.” https://arxiv.org/abs/2412.12509.

Verga, Pat, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. 2024. “Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models.” https://arxiv.org/abs/2404.18796.

Wang, Peiyi, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. “Large Language Models Are Not Fair Evaluators.” https://arxiv.org/abs/2305.17926.

Ye, Jiayi, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, et al. 2024. “Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge.” https://arxiv.org/abs/2410.02736.

Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, et al. 2023. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” https://arxiv.org/abs/2306.05685.