Jacky Kwok$^{1}$$^{\dagger}$, Shulu Li$^{2}$, Pranav Atreya$^{2}$, Yuejiang Liu$^{1}$, Marco Pavone$^{13}$
Ion Stoica$^{2 §}$, Azalia Mirhoseini$^{1§}$ :stanford:Stanford University$^{1§}$$^{1}$ :berkeley: UC Berkeley $^{2}$ :nvidia:NVIDIA$^{3}$
$^{\dagger}$Project Lead
*Core Contribution
§Equal Advising
🗓️ Posted: April 9, 2026
<aside> 👑
Try LLM-as-a-Verifier on GitHub:

From Stanford AI Lab & UC Berkeley Sky Computing Lab
</aside>
Across challenging long-horizon benchmarks such as Terminal-Bench 2.0 and SWE-Bench Verified, LLM-as-a-Verifier outperforms frontier models including Claude Opus 4.6, GPT 5.4, and Gemini Models. Results are reported from the official Terminal-Bench and SWE-Bench leaderboard.

Note: We use ForgeCode and mini-swe-agent as the scaffolds. For TerminalBench, we sample 5 trajectories from Claude Opus 4.6. For SWE-Bench, we sample 3 trajectories each from Claude Opus 4.6, Gemini 3 Flash, and Claude Opus 4.5. Gemini 2.5 Flash is used as the verifier in our experiments. Our results are fully reproducible and available on GitHub.
We find that verification accuracy consistently improves as we scale the scoring granularity, repeated verification, and criteria decomposition. LLM-as-a-Verifier achieves 78.9% pairwise verification accuracy on Terminal-Bench and enhances downstream success rate from 81.8% to 86.4% (SOTA) through test-time scaling and verification.

Standard LLM-as-a-Judge :lm-judge: prompts the model to output a score token (e.g., 1–8) and select the highest-probability token, using it as the final discrete score. However, this approach often suffers from coarse-grained scoring. When comparing complex agent trajectories, standard LLM-as-a-Judge often assigns the same score (e.g., both trajectories receive a score of 4), resulting in a tie and failing to discriminate between them. Coarse scoring leads to 27% ties on Terminal-Bench**.**

