LLM-as-a-Verifier: A General-Purpose Verification Framework

Jacky Kwok$^{1}$$^{\dagger}$, Shulu Li$^{2}$, Pranav Atreya$^{2}$, Yuejiang Liu$^{1}$, Marco Pavone$^{13}$

Ion Stoica$^{2 §}$, Azalia Mirhoseini$^{1§}$ :stanford:Stanford University$^{1§}$$^{1}$ :berkeley: UC Berkeley $^{2}$ :nvidia:NVIDIA$^{3}$

$^{\dagger}$Project Lead

*Core Contribution

§Equal Advising

🗓️ Posted: April 9, 2026

<aside> 👑

SOTA on Terminal-Bench & SWE-Bench Verified

We introduce LLM-as-a-Verifier, a general-purpose verification framework that provides fine-grained feedback by scaling scoring granularity, repeated verification, and criteria decompositions
LLM-as-a-Verifier achieves state-of-the-art performance on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%) when used as a trajectory reward model for test-time scaling

Try LLM-as-a-Verifier on GitHub:

From Stanford AI Lab & UC Berkeley Sky Computing Lab

</aside>

Evaluating LLM-as-a-Verifier

Across challenging long-horizon benchmarks such as Terminal-Bench 2.0 and SWE-Bench Verified, LLM-as-a-Verifier outperforms frontier models including Claude Opus 4.6, GPT 5.4, and Gemini Models. Results are reported from the official Terminal-Bench and SWE-Bench leaderboard.

Note: We use ForgeCode and mini-swe-agent as the scaffolds. For TerminalBench, we sample 5 trajectories from Claude Opus 4.6. For SWE-Bench, we sample 3 trajectories each from Claude Opus 4.6, Gemini 3 Flash, and Claude Opus 4.5. Gemini 2.5 Flash is used as the verifier in our experiments. Our results are fully reproducible and available on GitHub.

TL;DR

We find that verification accuracy consistently improves as we scale the scoring granularity, repeated verification, and criteria decomposition. LLM-as-a-Verifier achieves 78.9% pairwise verification accuracy on Terminal-Bench and enhances downstream success rate from 81.8% to 86.4% (SOTA) through test-time scaling and verification.

Motivation

Standard LLM-as-a-Judge :lm-judge: prompts the model to output a score token (e.g., 1–8) and select the highest-probability token, using it as the final discrete score. However, this approach often suffers from coarse-grained scoring. When comparing complex agent trajectories, standard LLM-as-a-Judge often assigns the same score (e.g., both trajectories receive a score of 4), resulting in a tie and failing to discriminate between them. Coarse scoring leads to 27% ties on Terminal-Bench**.**