Jacky Kwok$^{1}$$^{\dagger}$, Shulu Li$^{2}$, Pranav Atreya$^{2}$, Yuejiang Liu$^{1}$, Marco Pavone$^{13}$

Ion Stoica$^{2 §}$, Azalia Mirhoseini$^{1§}$ :stanford:Stanford University$^{1§}$$^{1}$ :berkeley: UC Berkeley $^{2}$ :nvidia:NVIDIA$^{3}$

$^{\dagger}$Project Lead

*Core Contribution

§Equal Advising

🗓️ Posted: April 9, 2026

<aside> 👑

SOTA on Terminal-Bench & SWE-Bench Verified

Try LLM-as-a-Verifier on GitHub:

Gemini_Generated_Image_5r8vb25r8vb25r8v.png

From Stanford AI Lab & UC Berkeley Sky Computing Lab

</aside>

Evaluating LLM-as-a-Verifier

Across challenging long-horizon benchmarks such as Terminal-Bench 2.0 and SWE-Bench Verified, LLM-as-a-Verifier outperforms frontier models including Claude Opus 4.6, GPT 5.4, and Gemini Models. Results are reported from the official Terminal-Bench and SWE-Bench leaderboard.

0.png

Note: We use ForgeCode and mini-swe-agent as the scaffolds. For TerminalBench, we sample 5 trajectories from Claude Opus 4.6. For SWE-Bench, we sample 3 trajectories each from Claude Opus 4.6, Gemini 3 Flash, and Claude Opus 4.5. Gemini 2.5 Flash is used as the verifier in our experiments. Our results are fully reproducible and available on GitHub.

TL;DR

We find that verification accuracy consistently improves as we scale the scoring granularity, repeated verification, and criteria decomposition. LLM-as-a-Verifier achieves 78.9% pairwise verification accuracy on Terminal-Bench and enhances downstream success rate from 81.8% to 86.4% (SOTA) through test-time scaling and verification.

image.png

Motivation

Standard LLM-as-a-Judge :lm-judge: prompts the model to output a score token (e.g., 1–8) and select the highest-probability token, using it as the final discrete score. However, this approach often suffers from coarse-grained scoring. When comparing complex agent trajectories, standard LLM-as-a-Judge often assigns the same score (e.g., both trajectories receive a score of 4), resulting in a tie and failing to discriminate between them. Coarse scoring leads to 27% ties on Terminal-Bench**.**

image.png

image.png